ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.
Support for fast spaced seed hashing has been added to ntHash2. To try it out, check out the README on the development branch. ntHash2 will soon replace ntHash on the master branch.
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
To install nttest in a specified directory:
$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install
The nttest suite has the options for runtime and uniformity tests.
For the runtime test the program has the following options:
nttest [OPTIONS] ... [FILE]
Parameters:
-k
,--kmer=SIZE
: the length of k-mer used for runtime test hashing[50]
-h
,--hash=SIZE
: the number of generated hashes for each k-mer[1]
FILE
: is the input fasta or fastq file
For example to evaluate the runtime of different hash methods on the test file reads.fa
in DATA/ folder for k-mer length 50
, run:
$ nttest -k50 reads.fa
For the uniformity test using the Bloom filter data structure the program has the following options:
nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]
Parameters:
-q
,--qnum=SIZE
: number of queries in query file-l
,--qlen=SIZE
: length of reads in query file-t
,--tnum=SIZE
: number of sequences in reference file-g
,--tlen=SIZE
: length of reference sequence-i
,--input
: generate random query and reference files-j
,threads=SIZE
: number of threads to run uniformity test[1]
REF_FILE
: the reference file nameQUERY_FILE
: the query file name
For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:
100
genes of length5,000,000bp
as reference in filegenes.fa
4,000,000
reads of length250bp
as query in filereads.fa
12
threads
run:
$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa
To hash all k-mers of length k
in a given sequence seq
:
string kmer = seq.substr(0, k);
uint64_t hVal=0;
hVal = NTF64(kmer.c_str(), k); // initial hash value
...
for (size_t i = 0; i < seq.length() - k; i )
{
hVal = NTF64(hVal, seq[i], seq[i k], k); // consecutive hash values
...
}
To canonical hash all k-mers of length k
in a given sequence seq
:
string kmer = seq.substr(0, k);
uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
...
for (size_t i = 0; i < seq.length() - k; i )
{
hVal = NTC64(seq[i], seq[i k], k, fhVal, rhVal); // consecutive hash values
...
}
To multi-hash with h
hash values all k-mers of length k
in a given sequence seq
:
string kmer = seq.substr(0, k);
uint64_t hVec[h];
NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
...
for (size_t i = 0; i < seq.length() - k; i )
{
NTM64(seq[i], seq[i k], k, h, hVec); // consecutive hash vectors
...
}
Enables ntHash on sequences
To hash all k-mers of length k
in a given sequence seq
with h
hash values using ntHashIterator:
ntHashIterator itr(seq, h, k);
while (itr != itr.end())
{
... use *itr ...
itr;
}
Outputing hash values of all k-mers in a sequence
#include <iostream>
#include <string>
#include "ntHashIterator.hpp"
int main(int argc, const char* argv[])
{
/* test sequence */
std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
/* k is the k-mer length */
unsigned k = 70;
/* h is the number of hashes for each k-mer */
unsigned h = 1;
/* init ntHash state and compute hash values for first k-mer */
ntHashIterator itr(seq, h, k);
while (itr != itr.end()) {
std::cout << (*itr)[0] << std::endl;
itr;
}
return 0;
}
Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397
This projects uses:
- CATCH unit test framework for C/C