ntHash

ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.

ntHash2

Support for fast spaced seed hashing has been added to ntHash2. To try it out, check out the README on the development branch. ntHash2 will soon replace ntHash on the master branch.

Build the test suite

$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

To install nttest in a specified directory:

$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install

The nttest suite has the options for runtime and uniformity tests.

Runtime test

For the runtime test the program has the following options:

nttest [OPTIONS] ... [FILE]

Parameters:

-k, --kmer=SIZE: the length of k-mer used for runtime test hashing [50]
-h, --hash=SIZE: the number of generated hashes for each k-mer [1]
FILE: is the input fasta or fastq file

For example to evaluate the runtime of different hash methods on the test file reads.fa in DATA/ folder for k-mer length 50, run:

$ nttest -k50 reads.fa

Uniformity test

For the uniformity test using the Bloom filter data structure the program has the following options:

nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]

Parameters:

-q, --qnum=SIZE: number of queries in query file
-l, --qlen=SIZE: length of reads in query file
-t, --tnum=SIZE: number of sequences in reference file
-g, --tlen=SIZE: length of reference sequence
-i, --input: generate random query and reference files
-j, threads=SIZE: number of threads to run uniformity test [1]
REF_FILE: the reference file name
QUERY_FILE: the query file name

For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:

100 genes of length 5,000,000bp as reference in file genes.fa
4,000,000 reads of length 250bp as query in file reads.fa
12 threads

run:

$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa

Code samples

To hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal=0;
    hVal = NTF64(kmer.c_str(), k); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i  ) 
    {
        hVal = NTF64(hVal, seq[i], seq[i k], k); // consecutive hash values
        ...
    }

To canonical hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
    hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i  ) 
    {
        hVal = NTC64(seq[i], seq[i k], k, fhVal, rhVal); // consecutive hash values
        ...
    }

To multi-hash with h hash values all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVec[h];
    NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
    ...
    for (size_t i = 0; i < seq.length() - k; i  ) 
    {
        NTM64(seq[i], seq[i k], k, h, hVec); // consecutive hash vectors
        ...
    }

ntHashIterator

Enables ntHash on sequences

To hash all k-mers of length k in a given sequence seq with h hash values using ntHashIterator:

ntHashIterator itr(seq, h, k);			
while (itr != itr.end()) 
{
 ... use *itr ...
   itr;
}

Usage example (C )

Outputing hash values of all k-mers in a sequence

#include <iostream>
#include <string>
#include "ntHashIterator.hpp"

int main(int argc, const char* argv[])
{
	/* test sequence */
	std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
	
	/* k is the k-mer length */
	unsigned k = 70;
	
	/* h is the number of hashes for each k-mer */
	unsigned h = 1;

	/* init ntHash state and compute hash values for first k-mer */
	ntHashIterator itr(seq, h, k);
	while (itr != itr.end()) {
		std::cout << (*itr)[0] << std::endl;
		  itr;
	}

	return 0;
}

Publications

ntHash

Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397

acknowledgements

This projects uses:

CATCH unit test framework for C/C

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
lib		lib
unittest		unittest
vendor		vendor
CITATION.bib		CITATION.bib
ChangeLog		ChangeLog
LICENSE		LICENSE
Makefile.am		Makefile.am
README.md		README.md
autogen.sh		autogen.sh
azure-pipelines.yml		azure-pipelines.yml
configure.ac		configure.ac
ntHash-logo.jpg		ntHash-logo.jpg
ntHashIterator.hpp		ntHashIterator.hpp
nthash-logo.png		nthash-logo.png
nthash.hpp		nthash.hpp
nttest.cpp		nttest.cpp
ssHashIterator.hpp		ssHashIterator.hpp
sstest.cpp		sstest.cpp
stHashIterator.hpp		stHashIterator.hpp
sttest.cpp		sttest.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ntHash

ntHash2

Build the test suite

Runtime test

Uniformity test

Code samples

ntHashIterator

Usage example (C )

Publications

ntHash

acknowledgements

About

Releases

Packages

Languages

License

yoshihikosuzuki/ntHash

Folders and files

Latest commit

History

Repository files navigation

ntHash

ntHash2

Build the test suite

Runtime test

Uniformity test

Code samples

ntHashIterator

Usage example (C )

Publications

ntHash

acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages