#bioinformatics #min-hash #metagenomics #jaccard #containment

bin lib jam-rs

Just another (genomic) minhash (Jam) implementation in Rust

2 releases

0.1.0-beta.2 Nov 22, 2023

#116 in Biology

MIT license

5.5MB
1.5K SLoC

Rust License Crates.io Codecov Dependency status

jam-rs

Just another minhash (jam) implementation. A high performance minhash variant to screen extremely large (metagenomic) datasets in a very short timeframe. Implements parts of the ScaledMinHash / FracMinHash algorithm described in sourmash.

Unlike traditional implementations like sourmash or mash this version tries to focus on estimating the containment of small sequences in large sets by (optionally) introducing an intentional bias towards smaller sequences. This is intended to be used to screen terabytes of data in just a few seconds / minutes.

Installation

A pre-release is published via crates.io to install it use (you need to have cargo and the rust-toolchain installed, the easiest way is via rustup.rs):

cargo install jam-rs

If you want the bleeding edge development release you can install it via git:

cargo install --git https://github.com/St4NNi/jam-rs

Comparison

  • Multiple algorithms: xxhash3, ahash-fallback (for kmer < 32) and legacy murmurhash3
  • No jaccard similarity since this is meaningless when comparing small embeded sequences against large sets
  • Additional filter and sketching options to increase for specificity and sensitivity for small sequences in collections of large assembled metagenomes

Scaling methods

Multiple different scaling methods:

  • FracMinHash (fscale): Restricts the hash-space to a (lower) maximum fraction of u64::MAX / fscale
  • KmerCountScaling (kscale): Restrict the overall maximum number of hashes to a factor of kscale -> 10 means 1/10th of all k-mers will be stored
  • MinMaxAbsoluteScaling (nscale): Restricts the minimum or maximum number of hashes per sequence record

If KmerCountScaling and MinMaxAbsoluteScaling are used together the minimum number of hashes (per sequence record) will be guaranteed. FracMinHash and KmerCountScaling produce similar results, the first is mainly provided for sourmash compatibility.

Usage

$ jam
Just another (genomic) minhasher (jam), obviously blazingly fast

Usage: jam [OPTIONS] <COMMAND>

Commands:
  sketch  Sketch one or more files and write result to output file (or stdout)
  merge   Merge multiple input sketches into a single sketch
  dist    Estimate distance of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size
  help    Print this message or the help of the given subcommand(s)

Options:
  -t, --threads <THREADS>  Number of threads to use [default: 1]
  -f, --force              Overwrite output files
  -h, --help               Print help (see more with '--help')
  -V, --version            Print version

Sketching

The easiest way to sketch files is to use the jam sketch command. This accepts one or more input files (fastx / fastx.gz) or a .list file with a full list of input files. And sketches all inputs to a specific outpuf sketch file.

$ jam sketch
Sketch one or more files and write the result to an output file (or stdout)

Usage: jam sketch [OPTIONS] [INPUT]...

Arguments:
  [INPUT]...  Input file(s), one directory or one file with list of files to be hashed

Options:
  -o, --output <OUTPUT>        Output file
  -k, --kmer-size <KMER_SIZE>  kmer size, all sketches must have the same size to be compared [default: 21]
      --fscale <FSCALE>        Scale the hash space to a minimum fraction of the maximum hash value (FracMinHash)
      --kscale <KSCALE>        Scale the hash space to a minimum fraction of all k-mers (SizeMinHash)
  -t, --threads <THREADS>      Number of threads to use [default: 1]
  -f, --force                  Overwrite output files
      --nmin <NMIN>            Minimum number of k-mers (per record) to be hashed, bottom cut-off
      --nmax <NMAX>            Maximum number of k-mers (per record) to be hashed, top cut-off
      --format <FORMAT>        Change to other output formats [default: bin] [possible values: bin, sourmash]
      --algorithm <ALGORITHM>  Change the hashing algorithm [default: default] [possible values: default, ahash, xxhash, murmur3]
      --singleton              Create a separate sketch for each sequence record
  -h, --help                   Print help

Dist

Calculate the distance for one or more inputs vs. a large set of database sketches. Optionally specify a minimum cutoff in percent of matching kmers. Output is optional if not specified the result will be printed to stdout.

$ jam dist
Estimate containment of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size

Usage: jam dist [OPTIONS] --input <INPUT>

Options:
  -i, --input <INPUT>        Input sketch or raw file
  -d, --database <DATABASE>  Database sketch(es)
  -o, --output <OUTPUT>      Output to file instead of stdout
  -c, --cutoff <CUTOFF>      Cut-off value for similarity [default: 0.0]
  -t, --threads <THREADS>    Number of threads to use [default: 1]
  -f, --force                Overwrite output files
      --stats                Use the Stats params for restricting results
      --gc-lower <GC_LOWER>  Use GC stats with an upper bound of x% (gc_lower and gc_upper must be set)
      --gc-upper <GC_UPPER>  Use GC stats with an lower bound of y% (gc_lower and gc_upper must be set)
  -h, --help                 Print help

Merge

Merge multiple sketches into one large one.

$ jam merge
Merge multiple input sketches into a single sketch

Usage: jam merge [OPTIONS] --output <OUTPUT> [INPUTS]...

Arguments:
  [INPUTS]...  One or more input sketches

Options:
  -o, --output <OUTPUT>    Output file
  -t, --threads <THREADS>  Number of threads to use [default: 1]
  -f, --force              Overwrite output files
  -h, --help               Print help

License

This project is licensed under the MIT license. See the LICENSE file for more info.

Disclaimer

jam-rs is still in active development and not ready for production use. Use at your own risk.

Credits

This tool is heavily inspired by finch-rs/License and sourmash/License. Check them out if you need a more mature ecosystem with well tested hash functions and more features.

Dependencies

~8–16MB
~213K SLoC