Introduction
SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis Package). It is an updated version of SOAP software for short oligonucleotide alignment. The new program features in super fast and accurate alignment for huge amounts of short reads generated by Illumina/Solexa Genome Analyzer. Compared to soap v1, it is one order of magnitude faster. It require only 2 minutes aligning one million single-end reads onto the human reference genome. Another remarkable improvement of SOAPaligner is that it now supports a wide range of the read length.
SOAPaligner benefitted in time and space efficiency by a revolution in the basic data structures and algorithms used.The core algorithms and the indexing data structures (2way-BWT) are developed by the algorithms research group of the Department of Computer Science, the University of Hong Kong (T.W. Lam, Alan Tam, Simon Wong, Edward Wu and S.M. Yiu).
System Requirements
1. Hardware:
a) 64-bit x86-64 CPUs with SSE instructions.
b) 8 GB main memory ( for a genome as large as human’s).
c) 8 GB hard disk (for a genome as large as human’s).
2. Software:
a) 64-bit Linux system (kernel >=2.6).
Download
NOTE: Due to the copyright about some parts of source code, in current version, we can not open the SOAPaligner/soap2’s source code. If you want to use SOAPaligner/soap2 in other platforms, please feel free to contact us and you need to show your CPU architecture and OS kernerl version. And because the data structure is incompatible with 32bit systems, we will NOT provide relevant version for you.
Release 2.21 , 02-14-2011 New!
For GNU Linux X86_64 :
( MD5: 563b8b7235463b68413f9e841aa40779 )
Release 2.20 , 08-13-2009
For GNU Linux X86_64 :
( MD5: f9dc6fddbb2087959221447062c7ec6c )
SOAPaligner-v2.20-src :
(MD5: ca75753697b12749c42356f366738fef )
under (GNU/GPL v3) New!
SOAPaligner_builder :
(MD5: e130bf9d50d0b82604cba591c0f92796) Source with index builder.
For MAC OS X :
(MD5: 00134fe0bdf1c7ab1109b99a4cc09340 )
Compile enviroment:
System Version: Mac OS X 10.6.3 (10D573); Kernel Version: Darwin 10.3.0 x86_64 ; gcc version 4.2.1
New utility for SOAP:
1 soap2sam.pl : a format convertor.
2 soap.coverage can calculate sequencing coverage or physical
coverage as well as duplication rate and details of specific block for
each segments and whole genome by using SOAP, BLAT, BLAST, BlastZ, mum-
mer and MAQ aligement results with multi-thread.
soap.coverage : version 2.7.7 Download (MD5:7cf98626e3573d680ed0e767207bfa95)
Release 2.19 , 07-13-2009
For GNU Linux X86_64 :
( MD5: f72210a472d3341c80c6c7aa0abecdf1 )
NOTE:
Here is an additional version for SOAPaligner v2.19 ,that supports gzip I/O :
SOAPaligner-v2.19-gz.tar.gz (MD5: 6f8b3503a990cc00e45c3bdb8eff5985 )
Release 2.18 , 05-25-2009
CHANGE:
1. fix segment fault when do gap alignment and multithreads function
2. fix bugs some start postion <0 or > chrLength
3. -l option compatible with diff read_length
4. -s min_length after soft clip
5. seq and quality real length is coordinated by soft clip
6. MD contain no 0 except first
For GNU Linux X86_64 :
( MD5: 36b24eb23aadde0d6dbed238cf5e58be )
Release 2.17 , 04-03-2009
For GNU Linux X86_64 :
( MD5: 3fc5fc80a90ef92a6db9644a452b4522 )
Release 2.16 , 03-31-2009
CHANGE: Fix SegmentFault when -r 2
For GNU Linux X86_64 :
( MD5: f6fdb463aa5b1d315625976de71540f4 )
Release 2.15 , 03-27-2009
CHANGE: Fix bugs when do gap alignment.
For GNU Linux X86_64 :
( MD5: 10ee28d3a00cb87fa131080f5b2e7232 )
Release 2.11 , 03-17-2009
CHANGE: Fix bugs.
For GNU Linux X86_64 :
( MD5: 5bfbc46584a56c3178499d0e45c8999c )
Thanks all the user for testing the program and reporting bugs, especially Shawn Cokus and David Casero Diaz-Cano at UCLA, Heng Li at Sanger Institute and Junjie Qin at BGI Shenzhen.
Release 2.10 , 03-03-2009
CHANGE:
1. Allow more than 2 mismatches at 3'-ends when align long reads (>35bps);
2. Add the multithreads function.
For GNU Linux X86_64 :
( MD5: e5a984d62054a5c256efcab79e958a7f )
Release 2.01 , 11-24-2008
CHANGE: Fix some bugs.
For GNU Linux X86_64 :
( MD5: a78aa68373ae04525c5122b4b16e60d8 )
Release 2.01-Beta , 11-17-2008
For GNU Linux X86_64 : ( MD5: 97c3a7902bfd1340aea12e0638933095 )
Release 2.00 , 11-13-2008
For GNU Linux X86_64 : ( MD5: 37d7a2751fbe8c097abedf364a599f39 )
NOTE :
1. New!Now we offer a sort tool (named "msort") for SOAPaligner: msort.tar.gz | MORE
2. All above releases for Linux were built on suse 11 64-bit with 2.6 kernel.
Installation
- Download the SOAPaligner above .
- In the Linux console, type:
- In your directory there are 2 executable files, 2bwt-builder and soap.
tar zxvf SOAPaligner.tar.gz
cd SOAPaligner
Command Line Options
To run SOAPaligner, we need to build index files for the reference genome, and then search reads against the formatted index files.
1.Format reference sequence:
eg: ./2bwt-builder ~/human_genome.fa
Then under the directory there will be 13 index files, all their prefixes are your_fasta file name with “.index” added, e.g. human_genome.fa.index. The suffixes include *.amb, *.ann, *.bwt, *.fmv, *.hot, *.lkt, *.pac, *.rev.bwt, *.rev.fmv, *.rev.lkt, *.rev.pac, *.sa, and *.sai.
2.Alignment quick start:
For alignment of single-end reads:
For paired-end reads:
NOTE: For the –D option, the program can only accept the prefix of your index files, such as “~/human_genome.fa.index”.
3.Options:
-D STR Prefix name for reference index [*.index]. -a STR Query file, for SE reads alignment or one end of PE reads -b STR Query b file, one end of PE reads -o STR Output file for alignment results -2 STR Output file contains mapped but unpaired reads when do PE alignment -u STR Output file for unmapped reads, [none] -m INT Minimal insert size INT allowed for PE, [400] -x INT Maximal insert size INT allowed for PE, [600] -n INT Filter low quality reads contain more INT bp Ns, [5] -t Output reads id instead reads name, [none] -r INT How to report repeat hits, 0=none; 1=random one; 2=all, [1] -R RF alignment for long insert size(>= 2k bps) PE data, [none] FR alignment -l INT For long reads with high error rate at 3'-end, those can't align whole length, then first align 5' INT bp subsequence as a seed, [256] use whole length of the read -v INT Totally allowed mismatches in one read, [2] -M INT Match mode for each read or the seed part of read, which shouldn't contain more than 2 mismaches, [4] 0: exact match only 1: 1 mismatch match only 2: 2 mismatch match only 3: [gap] (coming soon) 4: find the best hits -p INT Multithreads, n threads, [1]
Evaluation
SOAPaligner needs about 2 hours to format the reference sequence and build indexing tables. The RAM usage is depending on the total size of the reference sequence. For the human reference genome, it will occupy 7GB RAM.
Table 1. Performance of aligning 1 million single-end reads (35bp read length) or 1 million read pairs onto the human reference genome
Time (sec)Single-end reads | Time (sec)Paired-end reads | RAM (GB) | |
SOAPaligher(soap2) | 120 | 505 | 6.8 |
soap | 1700 | 5743 | 13.4 |
Future Development
- Binary soap alignment output, and .gz input and output;
Acknowledgements
We appreciate Prof. T.W. Lam, Alan Tam, Simon Wong, Edward Wu and S.M. Yiu prominent work on 2way-BWT.