This is the web-site for Vmatch, a versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Here is a printable version of this HTML-page in PDF.
The Vmatch-manual gives many examples on how to use Vmatch. Here are the program’s most important features.
Usually, in a large scale matching problem, extensive portions of the sequences under consideration are static, i.e. they do not change much over time. Therefore it makes sense to preprocess this static data to extract information from it and to store this in a structured manner, allowing efficient searches. Vmatch does exactly this: it preprocesses a set of sequences into an index structure. This is stored as a collection of several files constituting the persistent index. The index efficiently represents all substrings of the preprocessed sequences and, unlike many other sequence comparison tools, allows matching tasks to be solved in time, independent of the size of the index. Different matching tasks require different parts of the index, but only the required parts of the index are accessed during the matching process.
Most software tools for sequence analysis are restricted to DNA and/or protein sequences. In contrast, Vmatch can process sequences over any user defined alphabet not larger than 250 symbols. Vmatch fully implements the concept of symbol mappings, denoting alphabet transformations. These allow the user to specify that different characters in the input sequences should be considered identical in the matching process. This feature is used to group similar amino acids, for example.
Vmatch allows a multitude of different matching tasks to be solved using the persistent index. Every matching task is basically characterized by (1) the kind of sequences to be matched, (2) the kind of matches sought, (3) additional constraints on the matches, and (4) the kind of postprocessing to be done with the matches.
In the standard case, Vmatch matches sequences over the same alphabet. Additionally, DNA sequences can be matched against a protein sequence index in all six reading frames. Finally, DNA sequences can be transformed in all six reading frames and compared against itself.
Where appropriate, Vmatch can compute the following kinds of matches, using state-of-the-art algorithms:
To compute degenerate substring matches or degenerate repeats, each kind of match (with the exception of tandem repeats and complete matches) can be taken as an exact seed and extended by either of two different strategies:
S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res., 29(22):4633–4642, 2001 for repeat detection,
Matches can be selected according to their length, their E-value, their identity value, or match score.
In the standard case, a match is displayed as an alignment including positional information. Alternatively, a match can directly be postprocessed in different ways:
M.I. Abouelhoda and E. Ohlebusch. A Local Chaining Algorithm and its Applications in Comparative Genomics. In Proc. 3rd Worksh. Algorithms in Bioinformatics (WABI 2003), number 2812 in Lecture Notes in Bioinformatics, pages 1–16. Springer-Verlag, 2003
N. Volfovsky, B.J. Haas, and S.L. Salzberg. A Clustering Method for Repeat Analysis in DNA Sequences. Genome Biology, 2(8):research0027.1–0027.11, 2001
Vmatch is based on enhanced suffix arrays described Abouelhoda, Kurtz & Ohlebusch, 2004. This data structure has been shown to be as powerful as suffix trees, with the advantage of a reduced space requirement and reduced processing time. Careful implementation of the algorithms and data structures incorporated in Vmatch have led to exceedingly fast and robust software, allowing very large sequence sets to be processed quickly. The 32-bit version of Vmatch can process up to 400 million symbols, if enough memory is available. For large server class machines (e.g. SUN-Sparc/Solaris, Intel Xeon/Linux, Compaq-Alpha/Tru64) Vmatch is available as a 64 bit version, enabling gigabytes of sequences to be processed.
The most common formats for input sequences (Fasta, Genbank, EMBL, and SWISSPROT) are accepted. The user does not have to specify the input format. It is automatically recognized. All input files can contain an arbitrary number of sequences. Gzipped compressed inputs are accepted.
Vmatch’s output can be parsed by other programs easily. Furthermore, several options allow for its customization. XML output is available and new output formats can easily be incorporated without changing Vmatch’s program code. Certain matches can easily be selected by user defined criteria, without intermediate output and subsequent parsing.
Up until now we have referred to Vmatch as a collection of programs. In the following we use the same name, vmatch (in typewriter font), for the most important program in this collection. Besides vmatch, there are the following programs available:
Here is an overview of the dataflow in Vmatch.
There are several tools which are based on the persistent index of Vmatch:
J.V. Choudhuri, C. Schleiermacher, S. Kurtz, and R. Giegerich. Genalyzer: Interactive visualization of sequence similarities between entire genomes. Bioinformatics, 20:1964–1965, 2004
Genalyzer is not available any more.
M. Höhl, S. Kurtz, and E. Ohlebusch. Efficient multiple genome alignment. Bioinformatics, 18(Suppl. 1):S312–S320, 2002
E. Ohlebusch and S. Kurtz. Space efficient computation of rare maximal exact matches between multiple sequences. J. Comp. Biol., 15(4):357–377, 2008
Please contact Stefan Kurtz if you are interested in using Multimat.
M. Beckstette, R. Homann, R. Giegerich, and S. Kurtz. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, 7:389, 2006
G. Gremme, V. Brendel, M.E. Sparks, and S. Kurtz. Engineering a software tool for gene prediction in higher organisms. Information and Software Technology, 47(15):965–978, 2005
We provide an annotated bibliography listing papers which applied Vmatch and shortly describe the tasks for which Vmatch was used. We omit our own papers. The references were collected by a search in Google scholar (which, as of Jan 2, 2016 retrieved 397 results.)
In this work Vmatch was used to a compute a non-redundant set from a large collection of protein sequences from Zea-Maize.
Similar applications are described in
Q. Dong, L. Roy, M. Freeling, V. Walbot, and V. Brendel. ZmDB, an integrated Database for Maize Genome Research. Nucleic Acids Res., 31:244–247, 2003.
S. Dash, J. Van Hemert, L. Hong, R. P. Wise, and J. A. Dickerson. PLEXdb: gene expression resources for plants and plant pathogens. Nucleic Acids Res., 40(Database issue):D1194–1201, Jan 2012
PLEXdb provides a Vmatch-based web-service to match PLEXdb probes.
This work describes PlantGDB, which provides a service called PatternSearch@PlantGDB for genome wide pattern searches in plant sequences. The service is based on Vmatch.
In this work Vmatch was used for three different tasks:
M. Turmel, C. Otis, and C. Lemieux. The Chloroplast Genome Sequence of Chara vulgaris Sheds New Light into the Closest Green Algal Relatives of Land Plants. Molecular Biology and Evolution, 23:1324–1338, 2006
In these papers Vmatch was used to search and compare repeated elements in different chloroplast DNA.
In this work Vmatch was used to compare target genes of the tomato Chs RNAi to a tomato gene index.
In this work Vmatch was used to search different plant genomes for matches of length at least 20 with maximum of 2 mismatches. Here the fact that Vmatch is an exhaustive search tool is important.
In this work Vmatch was used to determine the presence of shared repeated elements of minimum length 30, with up to 10% mismatches using in different sequence sets from the green alga Leptosira terrestris.
In this work Vmatch was used to map millions of short sequence reads to the A. Thaliana genome. Up to four mismatches and up to three indels were allowed in the matching process. The seed size was chosen to be 0. The reads were aligned using the best match strategy by iteratively increasing the the allowed number of mismatches and gaps at each round.
In this work Vmatch was used to map millions of short sequence reads to the A. Thaliana genome. Vmatch was part of a multi-step pipeline, combining a fast matching algorithm (Vmatch) for initial read mapping and an optimal alignment algorithm based on dynamic programming (QPALMA) for high quality detection of splice sites.
In this work Vmatch was used for motif searching in different plant genomes.
In this work Vmatch was used to map unique consensus sequence tags to the maize reference genome.
In this work Vmatch was used to identify and cluster repeated sequences in Floydiella chloroplast genome.
In this work Vmatch was used to calculate direct and reverse complementary matches of length 17 bp or greater with edit distance 1 or less between five nuclear chromosomes and mitochondrial and chloroplast genome sequences.
In this work Vmatch was used to search probe sequences against the maize genome the cDNA sequences of the official maize gene models.
In this work Vmatch was used for clustering sequences assembled from 454-reads of Thellungiella parvula, a model for the evolution of plant adaptation to extreme environments.
In this work Vmatch was used for grouping short reads into pools representing the same RAD tag.
In this work Vmatch was used for detecting and clustering repetitive sequences in diverse fern plastid genomes.
In this work Vmatch was used to precisely define the boundaries of all repeats with 100% sequence identity.
In this work Vmatch was used cluster sequences based on their six-frame translation.
In this work Vmatch was used to identify reciprocal best matches between the pigeonpea sequences and other legume sequences.
In this work Vmatch was used for assembly clustering and optimization of contigs for Neochloris oleoabundans (a Chlorophyceae class green microalgae).
In this work Vmatch was used to match reads against a repeat library to identity the content of the repetitive DNA per sequence read.
In this work Vmatch was used to align individual probes to representative gene models.
In this work Vmatch was used for performing exact searches with peptides against the filtered proteome of A. thaliana.
In this work Vmatch was used to map RNAseq reads, allowing up to two mismatches (option -h 2) and generating maximal substring matches that are unique in some reference dataset (option -mum cand).
In this work Vmatch was used to identify terminal inverted repeats of length range 10-65 bp, ≥ 80% identity, maximum inter-TIR distance 650 bp in in genomes of epichloid fungal endophytes of grasses.
In this work Vmatch was used to match putative unique transcript sequence assemblies.
In this work Vmatch was used for refining assemblies of Illumina reads in the context of a transcriptome project for plant virus vector Graminella nigrifrons.
In this work Vmatch was used for clustering repeats and for building a consensus repeat library in the context of genome and transcriptome projects for Azadirachta indica, a medicinal and pesticidal angiosperm.
In this work Vmatch was used to map unique consensus sequences tags to the maize reference genome and to predict targets of novel miRNAs.
In this work Vmatch was used for masking Long Terminal Repeats in the Maize Genome Sequence.
P. Hernandez, M. Martis, G. Dorado, M. Pfeifer, S. Galvez, S. Schaaf, N. Jouve, H. Šimková, M. Valarik, J. Dolezel, and K. F. Mayer. Next-generation sequencing and syntenic integration of flow-sorted arms of wheat chromosome 4A exposes the chromosome structure and gene content. Plant J., 69(3):377–386, Feb 2012
R. Philippe, E. Paux, I. Bertin, P. Sourdille, F. Choulet, C. Laugier, H. Šimková, J. Šafář, A. Bellec, S. Vautrin, et al. A high density physical map of chromosome 1bl supports evolutionary studies, map-based cloning and sequencing in wheat. Genome Biol, 14(6):R64, 2013
Vmatch was used to mask repetitive DNA.
In this work Vmatch was used to cluster 40 010 assembled isotigs.
In this work Vmatch was used to preprocess short reads in the context of identifying mircoRNA targets in tomato fruit development.
In this work Vmatch was used in an all-vs-all comparison to bin contigs into loci based on a minimum of 200 bp sequence overlap in the context of transcriptome assembly for two Agave-species.
In this work Vmatch was used to align 454-reads to assembled isotigs for Ragweed pollen.
In this work Vmatch was used for comparing gene sets.
In this work Vmatch was used to detect repetitive DNA content of chromosomal survey sequences from the Rye genome.
D. Kopeckỳ, M. Martis, J. Číhalíková, E. Hřibová, J. Vrána, J. Bartoš, J. Kopecká, F. Cattonaro, Š. Stočes, Petr Novák, et al. Flow sorting and sequencing meadow fescue chromosome 4f. Plant Physiology, 163(3):1323–1337, 2013
D. Kopeckỳ, M Martis, J Číhalíková, E Hřibová, J Vrána, J Bartoš, et al. Genomics of meadow fescue chromosome 4f. Plant Physiol, 163:1323–1337, 2013
Vmatch was used for identifying repetitive DNA content in contigs of meadow fescue chromosome 4F assembled from Illumina short reads.
F. Jay, Y. Wang, A. Yu, L. Taconnat, S. Pelletier, V. Colot, J.-P. Renou, and O. Voinnet. Misregulation of AUXIN RESPONSE FACTOR 8 underlies the developmental abnormalities caused by three distinct viral silencing suppressors in Arabidopsis. PLoS Pathog, 7(5):e1002035–e1002035, 2011
X. Wang, D. Weigel, and L. M. Smith. Transposon variants and their effects on gene expression in arabidopsis. PLoS Genet, 9(2):e1003255, 2013
Vmatch was used for mapping siRNA sequences to the Arabidopsis thaliana genome.
In this work Vmatch was used for the identification of binding motifs.
In this work Vmatch was used for masking one sequence set with another and for mapping miRNA sequences of all plant species present in a reference database to whole-genome assembly of Spirodela polyrhiza.
In this work Vmatch was used for repeat detection.
In this work Vmatch was used to eliminate redundancies in assemblies of Illumina reads in the context of studying plant defense mechanisms.
In this work Vmatch was used for clustering to determine a non-redundant set of assembled contigs.
In this work Vmatch was used for clustering sequences based on their RT and aRNH domain.
In this work Vmatch was used for identifying repeats in contigs assembled from 454-reads.
In this work Vmatch was used for identifying inverted repeats in chloroplast genomes.
In this work Vmatch was used to identify contaminations and repetitive elements by comparison of mRNA sequences to vector, bacterial and repeat databases.
In this work Vmatch was used to cluster contigs of different assemblies into groups of homologous sequences.
In this work Vmatch was used to identify inverted repeats in chloroplast genomes.
J.P. Fitch, S.N. Gardner, T.A. Kuczmarski, S. Kurtz, R. Myers, L.L. Ott, T.R. Slezak, E.A. Vitalis, A.T. Zemla, and P.M. McCready. Rapid development of nucleic acid diagnostics. Proceedings of the IEEE, 90(11):1708–1721, 2002
T. Slezak, T. Kuczmarski, L. Ott, C. Torres, D. Medeiros, J. Smith, B. Truitt, N. Mulakken, M. Lam, E. Vitalis, A. Zemla, C.E. Zhou, and S. Gardner. Comparative Genomics Tools Applied to Bioterrorism Defense. Briefings in Bioinformatics, 4(2):133–149, 2003
used Vmatch to detect unique substrings in large collection of DNA sequences. These unique substrings serve as signatures allowing for rapid and accurate diagnostics to identify pathogen bacteria and viruses. A similar application is reported in S.N. Gardner, T.A. Kuczmarski, E.A. Vitalis, and T.R. Slezak. Limitations of TaqMan PCR for Detecting Viral Pathogens I llustrated by Hepatitis A, B, C, and E Viruses and Human Immunodeficiency Virus. J. of Clinical Microbiology, 41(6):2417–2427, 2003.
In this work Vmatch was used to map signature tags to the genome of S. meliloti.
I. Grissa, G. Vergnaud, and C. Pourcel. CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res, 35(Web Server issue):W52–7, 2007
I. Grissa, G. Vergnaud, and C. Pourcel. The CRISPRdb database and tools to display CRISPRs and to generate dictionaries of spacers and repeats. BMC Bioinformatics, 8:172, 2007
used Vmatch to efficiently find maximal repeats, as a first step in localizing Clustered regularly interspaced short palindromic repeats (CRISPRs).
In this work Vmatch was used to map predicted sequences to information about Rho-independent terminators provided by a specific database.
In this work Vmatch was used to cluster DNA-sequences into families based on their six-frame translation.
In this work Vmatch was used to align 454-sequences to the Ecoli-genome and to cluster the sequences.
In this work Vmatch was used for detecting repeats in three bacterial species.
In this work Vmatch was used for masking repeats in 454-reads.
In this work Vmatch was used to identify distal primers.
In this work Vmatch was used for removing redundant transcripts assembled in an RNA-seq study based on Illumina reads for Heliothis virescens (tobacco budworm), infected with a virus.
In this work Vmatch was used to search unassembled Illumina reads of US and African strains of Xanthomonas oryzae for evidence of transcriptional activator-like effector sequences.
D. A. Hysom, P. Naraghi-Arani, M. Elsheikh, A. C. Carrillo, P. L. Williams, and S. N. Gardner. Skip the alignment: degenerate, multiplex primer and probe design using K-mer matching instead of alignments. PLoS ONE, 7(4):e34560, 2012
In this context Vmatch used for selecting multiplex compatible, degenerate primers and probes to detect diverse targets such as viruses.
In this work Vmatch was used to identify redundant contigs from de novo exome assemblies.
In this work Vmatch was used to identify reads which have no common 20-mers with other reads in a context of a marine viral metagenome project.
In this work Vmatch was used for clustering potential complete Endogenous retroviruses of the bat Myotis lucifugus into subfamilies.
B. L. Hurwitz, A. H. Westveld, J. R. Brum, and M. B. Sullivan. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses. Proc. Natl. Acad. Sci. U.S.A., 111(29):10714–10719, July 2014
B. L. Hurwitz, L. Deng, B. T. Poulos, and M. B. Sullivan. Evaluation of methods to concentrate and purify ocean virus communities through comparative, replicated metagenomics. Environ. Microbiol., 15(5):1428–1440, May 2013
J. R. Brum, B. L. Hurwitz, O. Schofield, H. W. Ducklow, and M. B. Sullivan. Seasonal time bombs: dominant temperate viruses affect southern ocean microbial dynamics. The ISME journal, 2015
Vmatch was used for k-mer analysis in the context of different marine metagenome projects.
In this work Vmatch was used for k-mer analysis in the context of microbial communities.
In this work Vmatch was used in an iterative scheme to construct contigs from reads associated with resistance genes in the context of a shotgun metagenome project.
In this work Vmatch was used to match probe candidate sequences against viral sequences and the human genmome sequence.
In this work Vmatch was used to identify the species of the Streptococcaceae by comparing with Silva 115 release 16S reference sequence database.
J. van Helden, A.F. Rios, and J. Collado-Vides. Discovering Regulatory Elements in Non-Coding Sequences by Analysis of Spaced Dyads. Nucleic Acids Res., 28(8):1808–1818, 2000
and developed by Jacques van Helden use Vmatch to purge sequences before computing sequence statistics. Similar applications are reported in the following papers:
R.J.M. Hulzink, H. Weerdesteyn, A.F. Croes, M.M.A. Gerats, T. van Herpen, and J. van Helden. In Silico Identification of Putative Regulatory Sequence Elements in the 5’-Untranslated Region of Genes That Are Expressed during Male Gametogenesis Gene Co-regulation. Plant Physiol., 132:75–83, 2003
N. Simonis, S.J. Wodak, G.N. Cohen, and J van Helden. Combining Pattern Discovery and Discriminant Analysis to Predict Gene Co-regulation. Bioinformatics, 20:2370–2379, 2004
N. Simonis, J. van Helden, G.N. Cohen, and S.J. Wodak. Transcriptional regulation of protein complexes in yeast. Genome Biology, 5:R33, 2004.
E. Coward, S.A. Haas, and M. Vingron. SpliceNest: Visualization of Gene Structure and Alternative Splicing Based on EST Clusters. Trends Genet., 18(1):53–55, 2002
computes gene indices and uses Vmatch to map clustered sequences to large genomes.
J. Krüger, A. Sczyrba, S. Kurtz, and R. Giegerich. e2g: An interactive web-based server for efficiently mapping large EST and cDNA sets to genomic sequences. Nucleic Acids Res., 32:W301–W304, 2004.
In this work Vmatch was used to (1) match 130 861 vector-trimmed sequences against the maize repeat database, and (2) to cluster near-identical sequences.
T. Dezulian, M. Schaefer, R. Wiese, D. Weigel, and D.H. Huson. CrossLink: visualization and exploration of sequence relationships between (micro) RNAs. Nucleic Acids Res., 34(Web Server Issue):W400–W404, 200
is a versatile computational tool which aids in visualizing relationships between RNA sequences (particularly between ncRNAs and their putative target transcripts) in an intuitive and accessible way. Besides BLAST, CrossLink uses Vmatch to reveal the sequence relationships to be visualized.
R. Arnold, T. Rattei, P. Tischler, M.-D. Truong, V. Stümpflen, and H.W. Mewes. SIMAP - The similarity matrix of proteins. Bioinformatics, 21(Suppl. 2):ii42–ii46, 2005
used Vmatch to locate the sequences in SIMAP which are similar to a given query. This is much faster than running BLAST.
In this work Vmatch was used to compute similarities between genomes, which are then visualized by the program DNAVis.
P.N. Seibel, J. Krüger, S. Hartmeier, K. Schwarzer, K. Löwenthal, H. Mersch, T. Dandekar, and R. Giegerich. XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinformatics, 7:490, 2006
Seidel et. al. describe methods for creating web-services and give examples which, among other tools, also integrate Vmatch.
J. Krumsiek, R. Arnold, and T. Rattei. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics, 23(8):1026–8, 2007
uses mkvtree to compute enhanced suffix arrays.
J. Martin, V. M. Bruno, Z. Fang, X. Meng, M. Blow, T. Zhang, G. Sherlock, M. Snyder, and Z. Wang. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics, 11:663, 2010
C. M. Lushbough, D. M. Jennewein, and V. Brendel. The bioextract server: a web-based bioinformatic workflow platform. Nucleic acids research, 39(suppl 2):W528–W532, 2011
uses Vmatch to remove duplicated sequences.
In this work Vmatch was used for removing duplicates in BlastP results. This use is part of a workflow in myexperiment.
In this work Vmatch was used for probe/primer search functionality in the probeBase database.
In this work Vmatch was used to reveal long repeats inside human chromosome 1 and long similar regions between human chromosome 1 and all other human chromosomes.
In this work Vmatch was used for Vector screening.
In this work Vmatch was used for mapping short reads.
In this work Vmatch was used for matching reads to sets of RNA sequences and the Human genome.
In this work Vmatch was used to uniquely map miRNAs against the human genome.
In this work Vmatch was used to determine the positions of CAGE tags on the human genome.
In this work Vmatch was used to align sections of reads against RefSeq mRNA exon sequences.
In this work Vmatch was used to align sets of genes.
In this work Vmatch was used to determine the positions of CAGE tags on the human genome.
In this work Vmatch was used to cluster 317 242 EST and cDNA sequences from Xenopus laevis. Vmatch was chosen for the following reasons:
In this work Vmatch was used to cluster EST-sequences of Xenopus laevis.
In this work Vmatch was used to search exact repeats in the Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila.
In this work Vmatch was used for mapping
In this work Vmatch was used to search small RNA signatures in entire miRNA gene sequences for Arabidopsis and rice.
In this work Vmatch was used to map small RNA data sets onto the corresponding reference genomes for different model organisms.
In this work Vmatch was used for mapping Illumina reads to the mouse genome.
In this work Vmatch was used for redundancy removal in the context of transcriptome assembly of a keelworm species.
In this work Vmatch was used to remove redundant contigs in a genome project of four Aureobasidium pullulans varieties.
In this work Vmatch was used for merging assemblies of Illumina sequenced cDNA.
In this work Vmatch was used to combine and scaffold contigs.
Total number of usages: 108
Vmatch is available for download in executable form for the following platforms:
Vmatch was developed since May 2000 by Stefan Kurtz, a professor of Computer Science at the Center for Bioinformatics, University of Hamburg, Germany.
Important Documents