-
Notifications
You must be signed in to change notification settings - Fork 0
lvn3668/paralogIdentification
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
README for stand-alone BLAST (last updated 08/26/2002) This document provides information on stand-alone BLAST. Topics covered are setting up stand-alone BLAST, command-line options for stand-alone BLAST, and a release history of the different versions. BLAST binaries are provided for IRIX6.2, Solaris2.6 (Sparc) Solaris2.7 (Intel), DEC OSF1 (ver. 4.0D), LINUX/Intel, HPUX, MacIntosh, and Win32 systems. We will attempt to produce binaries for other platforms upon request. Stand-alone binaries are available from ftp://ftp.ncbi.nih.gov/blast/executables/ Please remember to FTP in binary mode. Setting up Standalone BLAST for UNIX: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Basically, there are three steps needed to setup the Standalone BLAST executable for the UNIX platform. 1) Download the UNIX binary, uncompress and untar the file. It is suggested that you do this in a separate directory, perhaps called "blast". 2) Create a .ncbirc file. In order for Standalone BLAST to operate, you have will need to have a .ncbirc file that contains the following lines: [NCBI] Data="path/data/" Where "path/data/" is the path to the location of the Standalone BLAST "data" subdirectory. For Example: Data=/root/blast/data The data subdirectory should automatically appear in the directory where the downloaded file was extracted. Please note that in many cases it may be necessary to delimit the entire path including the machine name and or the net work you are located on. Your systems administrator can help you if you do not know the entire path to the data subdirectory. Make sure that your .ncbirc file is either in the directory that you call the Standalone BLAST program from or in your root directory. 3) Format your BLAST database files. The main advantage of Standalone BLAST is to be able to create your own BLAST databases. This can be done with any file of FASTA formatted protein or nucleotide sequences. If you are interested in creating your own database files you should refer to the sections "Non-redundant defline syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA description available from the BLAST search pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html). However, for a testing purposes you should download one of the NCBI databases and run a search against it. In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/) you will find the downloadable BLAST database files. For your first search we recommend downloading something relatively small like ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the 'formatdb' program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or you can get these arguments by running each of the BLAST programs (formatdb, blastall etc.) with a single hyphen as the argument (Example: formatdb -). For this document we are just going to show you the basic commands for formatting the database and running your first search. To format the ecoli.nt database run the following from the command line: formatdb -i ecoli.nt -p F -o T This will create seven index files that Standalone BLAST needs to perform the searches and produce results. The ecoli.nt file is not needed after formatdb has been done and you can delete this. Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a nucleotide sequence you know is in the downloaded ecoli.nt database. Make a text file called test.txt with the following sequence: >Test AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT To run the first search enter the following command from the UNIX command line in your BLAST directory: blastall -p blastn -d ecoli.nt -i test.txt -o test.out This should generate a results file called test.out in the Standalone BLAST directory. Now you are ready to create your own databases and run BLAST searches. For more information you should refer to the Standalone BLAST README ( ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results. If you have any questions please send them to the [email protected] e-mail address. Setting up Standalone BLAST for Windows -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- There are three steps needed to setup the Standalone BLAST executable. 1) Download and compress the Standalone BLAST Windows binary blastcz.exe. We suggest doing this in it's own directory, perhaps called blast. This is a 'self-extracting' archive and all you need to do is run this either through a Command Prompt (DOS Prompt) or by selecting "Run" from the Windows "Start button" and browsing the blastcz.exe file. 2) Create an ncbi.ini file. In order for Standalone BLAST to operate, you have will need to have an ncbi.ini file that contains the following lines: [NCBI] Data="C:\path\data\" Where "C:path\data\" is the path to the location of the Standalone BLAST "data" subdirectory. For example: Data=C:\blast\data This data subdirectory should automatically appear in the directory where the downloaded file was extracted. Make sure that your ncbi.ini file is in the Windows or WINNT directory on your machine. Note: If you already have an ncbi.ini file on your machine from installing other NCBI software(Network Entrez, Sequin etc.) you can skip this section. However, if you see the following error message, you should rename the old ncbi.ini file to something like ncbi.bak and follow the instructions in number 2 above. Abrupt: code=1 FATAL ERROR: FindPath failed. C) The main advantage of Standalone BLAST is to be able to create your own BLAST databases. This can be done with any file of FASTA formatted protein or nucleotide sequences. If you are interested in creating your own database you should refer to the sections "Non-redundant defline syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA description available from the BLAST search pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html). However, for a testing purposes you should download one of the NCBI databases and run a search against it. In the BLAST database FTP directory ftp://ftp.ncbi.nih.gov/blast/db/ you will find the downloadable BLAST database files. For your first search we recommend downloading something relatively small like ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. (If you do not have a copy of UNIX "uncompress" for your Windows PC contact NCBI Info at [email protected]). Once uncompressed, you will now need to format the database using the 'formatdb' program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or you can get these arguments by running each of the BLAST programs (formatdb, blastall etc.) with a single hyphen as the argument (Example: formatdb -). For this document we are just going to show you the basic commands for formatting the database and running your first search. To format the ecoli.nt database run the following from the command line: formatdb -i ecoli.nt -p F -o T This will create seven index files that Standalone BLAST needs to perform the searches and produce results. The ecoli.nt file can be removed once formatdb has been run. Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a nucleotide sequence you know is in the downloaded ecoli.nt database. So make a text file called test.txt with the following sequence: >Test AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT To run the first search just do the command: blastall -p blastn -d ecoli.nt -i test.txt -o test.out This should generate a results file called test.out in the Standalone BLAST directory. Now you are ready to create your own databases and run BLAST searches. For more information you should refer to the Standalone BLAST README ( ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results. If you have any questions please send them to the [email protected] e-mail address. SGI Note: --------- SGI recommends the following threads patches on IRIX6 systems: For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order) For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order) For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order) These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/ System recommendations: ---------------------- BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if it can read the entire BLAST database into memory, then keep on using it there. Resources consumed reading a database into memory can easily outweight the cost of a BLAST search, so that the memory of a machine is normally more important than the CPU speed. This means that one should have sufficient memory for the largest BLAST database one will use, then run all the searches against this databases in serial, then run queries against another database in serial. This guarantees that the database will be read into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg, which translates to about 170-200 Meg of BLAST database. At least another 100-200 Meg should be allowed for memory consumed by the actual BLAST program. All of the FASTA databases together are about 1.5 Gig, the BLAST databases produced from this will probably be about another Gig or so. 4 Gig of disk space, to make room for software and output, is probably a pretty good bet. OSF1 and limit -------------- Some OSF1 users have encountered "out of memory" problems when running searches even though there seems to be plenty of memory on the machine and the search runs well on other platforms. The error message would look something like: [blastall] FATAL ERROR: CoreLib [001.000] gi|509180|emb|X71670.1|MMP17SAR: Failed to allocate 480 bytes Often it is sufficient to simply raise the "datasize" limit, which specifies the maximum allowed heap size. The "datasize" limit can be changed by executing: limit datasize unlimited Note that this change only applies to the current session, so it is advisable to place this command in some file sourced at startup, such as .login or .cshrc. BLAST OPTIONS ------------- Formatdb -------- There is now a separate document describing formatdb (README.formatdb). Please refer to it for information on formatting FASTA files for BLAST searches. Blastall -------- Blastall may be used to perform all five flavors of blast comparison. One may obtain the blastall options by executing 'blastall -' (note the dash). A typical use of blastall would be to perform a blastn search (nucl. vs. nucl.) of a file called QUERY would be: blastall -p blastn -d nr -i QUERY -o out.QUERY The output is placed into the output file out.QUERY and the search is performed against the 'nr' database. If a protein vs. protein search is desired, then 'blastn' should be replaced with 'blastp' etc. Some of the most commonly used blastall options are: blastall arguments: -p Program Name [String] Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx". -d Database [String] default = nr The database specified must first be formatted with formatdb. Multiple database names (bracketed by quotations) will be accepted. An example would be -d "nr est" which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database of nr and est. -i Query File [File In] default = stdin The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched. -e Expectation value (E) [Real] default = 10.0 -o BLAST report Output File [File Out] Optional default = stdout -F Filter query sequence (DUST with blastn, SEG with others) [String] default = T BLAST 2.0 and 2.1 uses the dust low-complexity filter for blastn and seg for the other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit and are accessed automatically. If one uses "-F T" then normal filtering by seg or dust (for blastn) occurs (likewise "-F F" means no filtering whatsoever). This options also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters. Please see the "Filtering Strings" section (below) for details. -S Query strands to search against database (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom [Integer] default = 3 -T Produce HTML output [T/F] default = F -l Restrict search of database to list of GI's [String] Optional This option specifies that only a subset of the database should be searched, determined by the list of gi's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory that BLAST is called from. -U Use lower case filtering of FASTA sequence [T/F] Optional default = F This option specifies that any lower-case letters in the input FASTA file should be masked. Documentation for PSI-TBLASTN PSI-BLASTN is a variant of blastall that searches a protein query sequence against a nucleotide sequence database using a position specific matrix created by PSI-BLAST. The nucleotide sequence database is dynamically translated in all reading frames during PSI-TBLASTN search. Using a position specific matrix may enable finding more distantly related sequences. Programs: blastpgp [takes a protein query and perform PSI-BLAST search to creates a position specific matrix using a protein database] blastall [reads position specific matrix and performs PSI-TBLASTN search] Usage: A user would typically run blastpgp to create and save a position specific matrix, followed by a run of blastall for PSI-TBLASTN search. blastpgp must be executed with -C option followed by a file name to save position specific score matrix. blastall with "-p psitblastn" option executes PSI-TBLASTSN search, and -R option followed by a file name specifying the file that contains position specific score matrix. All other options that apply when using "blastall -p tblastn ..." also apply when using "blastall -p psitblastn ...", but there are some restrictions to parameters: 1) The query must be the same as the one used in blastpgp for creating a position specific matrix. 2) By default, blastpgp has filtering off (-F F) and blastall has filtering on (-F T). To ensure consistent usage of the blastpgp/psitblastn combination, the -F option should be explicitly set in one or the other run. Example: One may run PSI-BLST to create and save a position specific score matrix as follows: blastpgp -d nr -i ff.chd -j 2 -C ff.chd.ckp Position specific score matrix is saved in ff.chd.ckp. Then, using this matrix, one may run PSI-TBLASTN search: blastall -i ff.chd -d yeast -p psitblastn -R ff.chd.ckp Note that this allows the score matrix to be constructed using one database (nr in the example) and then used to search a second database (yeast in the example). Even if the two database names are the same, blastpgp uses the protein version while "blastall -p psitblastn" uses the DNA version. Blastpgp -------- Blastpgp performs gapped blastp searches and can be used to perform iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and PHI-BLAST sections (below) for a description of this binary. The options may be obtained by executing 'blastpgp -'. -T Produce HTML output [T/F] default = F -Q Output File for PSI-BLAST Matrix in ASCII [File Out] Optional Bl2seq ------ Bl2seq performs a comparison between two sequences using either the blastn or blastp algorithm. Both sequences must be either nucleotides or proteins. The options may be obtained by executing 'bl2seq -'. -i First sequence [File In] -j Second sequence [File In] -p Program name: blastp, blastn, blastx. For blastx 1st argument should be nucleotide [String] default = blastp -g Gapped [T/F] default = T -o alignment output file [File Out] default = stdout -d theor. db size (zero is real size) [Integer] default = 0 -a SeqAnnot output file [File Out] Optional -G Cost to open a gap (zero invokes default behavior) [Integer] default = 0 -E Cost to extend a gap (zero invokes default behavior) [Integer] default = 0 -X X dropoff value for gapped alignment (in bits) (zero invokes default behavior) [Integer] default = 0 -W Wordsize (zero invokes default behavior) [Integer] default = 0 -M Matrix [String] default = BLOSUM62 -q Penalty for a nucleotide mismatch (blastn only) [Integer] default = -3 -r Reward for a nucleotide match (blastn only) [Integer] default = 1 -F Filter query sequence (DUST with blastn, SEG with others) [String] default = T -e Expectation value (E) [Real] default = 10.0 -S Query strands to search against database (blastn only). 3 is both, 1 is top, 2 is bottom [Integer] default = 3 -T Produce HTML output [T/F] default = F Fastacmd -------- Fastacmd retrives FASTA formatted sequences from a BLAST database, if it was formatted using the '-o' option. An example fastacmd call would be: fastacmd -d nr -s p38398 The fastacmd options are: fastacmd arguments: -d Database [String] default = nr -s Search string: GIs, accessions and locuses may be used delimited by comma or space) [String] Optional -i Input file wilth GIs/accessions/locuses for batch retrieval [String] Optional -a Retrieve duplicated accessions [T/F] Optional default = F -l Line length for sequence [Integer] Optional default = 80 Filtering Strings ----------------- The -F argument can take a string as input specifying that seg should be run with certain values or that other non-standard filters should be used. This sections describes this syntax. The seg options can be changed by using: -F "S 10 1.0 1.5" which specifies a window of 10, locut of 1.0 and hicut of 1.5. A coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991)) and written by John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked by specifying: -F "C" There are three parameters for this: window, cutoff (prob of a coil-coil), and linker (distance between two coiled-coiled regions that should be linked together). These are now set to window: 22 cutoff: 40.0 linker: 32 One may also change the coiled-coiled parameters in a manner analogous to that of seg: -F "C 28 40.0 32" will change the window to 28. One may also run both seg and coiled-coiled together by using a ";": -F "C;S" Filtering by dust may also be specified by: -F "D" It is possible to specify that the masking should only be done during the process of building the initial words by starting the filtering command with 'm', e.g.: -F "m S" which specifies that seg (with default arguments) should be used for masking, but that the masking should only be done when the words are being built. This masking option is available with all filters. If the -U option (to mask any lower-case sequence in the input FASTA file) is used and one does not wish any other filtering, but does wish to mask when building the lookup tables then one should specify: -F "m" This is the only case where "m" should be specified alone. PSI-Blast --------- The blastpgp program can do an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. In this usage, the program is called Position-Specific Iterated BLAST, or PSI-BLAST. As explained in the accompanying paper, the BLAST algorithm is not tied to a specific score matrix. Traditionally, it has been implemented using an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the query sequence; at each position the cost of a letter depends on the position w.r.t. the query and the letter in the subject sequence. The position-specific matrix for round i 1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value in round i. The top part of the output for each round distinguishes the sequences into: sequences found previously and used in the score model, and sequences not used in the score model. The output currently includes lots of diagnostics requested by users at NCBI. To skip quickly from the output of one round to the next, search for the string "producing", which is part of the header for each round and likely does not appear elsewhere in the output. PSI-BLAST "converges" and stops if all sequences found at round i 1 below the e-value threshold were already in the model at the beginning of the round. There are several blastpgp parameters specifically for PSI-BLAST: -j is the maximum number of rounds (default 1; i.e., regular BLAST) -h is the e-value threshold for including sequences in the score matrix model (default 0.001) -c is the "constant" used in the pseudocount formula specified in the paper (default 10) The -C and -R flags provide a "checkpointing" facility whereby a score model can be stored and later reused. -C stores the query and frequency count ratio matrix in a file -R restarts from a file stored previously. When using -R, it is required that the query specified on the command line match exactly the query in the restart file. The checkpoint files are stored in a byte-encoded (not human readable) format, so as to prevent roundoff error between writing and reading the checkpoint. Users who also develop their own sequence analysis software may wish to develop their own scoring systems. For this purpose the code in posit.c that writes out the checkpoint can be easily adapated to write out scoring systems derived by other algorithms in such a way that PSI-BLAST can read the files in later. The checkpoint structure is general in the sense that it can handle any position-specific matrix that fits in the Karlin-Altschul statistical framework for BLAST scoring. The -B flag provides a way to jump start PSI-BLAST from a master-slave multiple alignment computed outside PSI-BLAST. The multiple alignment must include the query sequence as one of the sequences, but it need not be the first sequence. The multiple alignment must be specified in a format that is derived from Clustal, but without some headers and trailers. See example below. The rules are also described by the following words. Suppose the multiple alignments has N sequences. It may be presented in 1 or more blocks, where each block presents a range of columns from the multiple alignment. E.g., the first block might have columns 1-60, the second block might have columns 61-95, the third block might have columns 96-128. Each block should have N rows, 1 row per sequence. The sequences should be in the same order in every block. Blocks are separated by 1 or more blank lines. Within a block there are no blank lines, and each line consists of 1 sequence identifier followed by some white space followed by characters (and gaps) for that sequence in the multiple alignment. In each column, all letters must be in upper case, or all letters must be in lower case. Upper case means that this column is to be given position-specific scores. Lower-case means to use the underlying matrix (specified by -M) for this column; e.g., if the query sequence has an 'l' residue in the column, then the standard scores for matching an L are used in the column. A sample usage would be: blastpgp -i seq1 -B align1 -j 2 -d nr where seq1 is the query align1 is the alignment file -j 2 indicates to do 2 rounds -d nr indicates to use the nr database The example files seq1 align1 copied below were kindly supplied by L. Aravind from a paper he and Chris Ponting published in Protein Science: Aravind L, Ponting CP, Homologues of 26S proteasome subunits are regulators of transcription and translation, Protein Science 7(1998) 1250-1254. L. Aravind ([email protected]) was the first user and helped define how -B should work. Y. Wolf ([email protected]) helped design a more flexible input format for the alignments. If you like how -B works, let them know. If you do not like how -B works, complain to A. Schaffer([email protected]) who did the implementation. seq1 ---- > 26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP align1 ------ 26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet 644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw------------------------------------------------------- T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw------------------------------------------------------- YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih 26SPS9_Hs sladfekaltdy----------------------------------------------------------------------------------- F57B9_Ce rslkdfqvafgsf---------------------------------------------------------------------------------- YDL097c_Sc aynnrslldfntalkqy------------------------------------------------------------------------------ YMJ5_Ce krslkdfvkalaeh--------------------------------------------------------------------------------- FUS6_ARATH vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------ COS41.8_Ci eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar-------------------------------- 644879 kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem---------- YPR108w_Sc yasdyasyfpyllety------------------------------------------------------------------------------- eif-3p110_Hs ----------------------------------------------------------------------------------------------- T23D8.4_Ce ----------------------------------------------------------------------------------------------- YD95_Sp ylcdysgffrtladve------------------------------------------------------------------------------- KIAA0107_Hs rysvffqslavv----------------------------------------------------------------------------------- F49C12.8_Hs esyydchydrffiqlaale---------------------------------------------------------------------------- Int-6_Mm wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk 26SPS9_Hs ----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP F57B9_Ce ----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV YDL097c_Sc ----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN YMJ5_Ce ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD FUS6_ARATH ----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ COS41.8_Ci ----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET 644879 ----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ YPR108w_Sc ----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN eif-3p110_Hs ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP T23D8.4_Ce ----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP YD95_Sp ----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE KIAA0107_Hs ----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS F49C12.8_Hs ----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS Int-6_Mm lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV PHI-Blast --------- PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. The most important features of the program have been incorporated into the BLAST software framework partly for user convenience and partly so that PHI-BLAST may be combined seamlessly with PSI-BLAST. Other features that do not fit into the BLAST framework will be released later as a separate program and/or separate Web page query options. One very restrictive way to identify protein motifs is by regular expressions that must contain each instance of the motif. The PROSITE database is a compilation of restricted regular expressions that describe protein motifs. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. PHI-BLAST may be preferable to other flavors of BLAST because it is faster and because it allows the user to express a rigid pattern occurrence requirement. The pattern search methods in PHI-BLAST are based on the algorithms in: R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82. S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91. The calculation of local alignments is done using a method very similar to (and much of the same code as) gapped BLAST. However, the method of evaluating statistical significance is different, and is described below. In the stand-alone mode the typical PHI-BLAST usage looks like: blastpgp -i -k -p patseedp where -i is followed by the file containing the query in FASTA format where -k is followed by the file containing the pattern in a syntax given below and "patseedp" indicates the mode of usage, not representing any file. The syntax for the query sequence is FASTA format as for all other BLAST queries. The syntax for patterns follows the rules of PROSITE and is documented in detail below. The specified pattern is not required to be in the PROSITE list. Most of the other BLAST flags can be used with PHI-BLAST. One important exception is that PHI-BLAST requires gapped alignments (i.e. forbids -g F in the flags) because ungapped alignments do not make sense for almost all patterns in PROSITE. There is a second mode of PHI-BLAST usage that is important when the specified pattern occurs more than 1 time in the query. In this case, the user may be interested in restricting the search for local alignments to a subset of the pattern occurrences. This can be done with a search that looks like: blastpgp -i -k -p seedp in which case the use of the "seedp" option requires the user to specify the location(s) of the interesting pattern occurrence(s) in the pattern file. The syntax for how to specify pattern occurrences is below. When there are multiple pattern occurrences in the query it may be important to decide how many are of interest because the E-value for matches is effectively multiplied by the number of interesting pattern occurrences. The PHI-BLAST Web page supports only the "patseedp" option. PHI-BLAST is integrated with PSI-BLAST. In the command-line mode, PSI-BLAST can be invoked by using the -j option, as usual. When this is done as: blastpgp -i -k -p patseedp -j then the first round of searching uses PHI-BLAST and all subsequent rounds use PSI-BLAST. In the Web page setting, the user must explicitly invoke one round at a time, and the PHI-BLAST Web page provides the option to initiate a PSI-BLAST round with the PHI-BLAST results. To describe a combined usage, use the term "PHI-PSI-BLAST" (Pattern-Hit Initiated, Position-Specific Iterated BLAST). Determining statistical significance. When a query sequence Q matches a database sequence D in PHI-BLAST, it is useful to subdivide Q and D into 3 disjoint pieces Qleft Qpattern Qright Dleft Dpattern Dright The substrings Qpattern and Dpattern contain the pattern specified in the pattern file. The pieces Qpattern and Dpattern are aligned and that alignment is displayed as part of the PHI-BLAST output, but the score for that alignment is mostly ignored. The "reduced" score r of an alignment is the sum of the scores obtained by aligning Qleft with Dleft and by aligning Qright with Dright. The expected number of alignments with a reduced score >= x is given by: CN(Lambda*x 1)e^(-Lambda *x) where: C and Lambda are "constants" depending on the score matrix and the gap costs. N is (number of occurrences of pattern in database) * (number of occurrences of pattern in Q) e is the base of the natural logarithm. It is important to understand that this method of computing the statistical significance of a PHI-BLAST alignment is mathematically different from the method used for BLAST and PSI-BLAST alignments. However, both methods provide E-values, so they the E_values are displayed with a similar output syntax. Rules for pattern syntax for PHI-BLAST. The syntax for patterns in PHI-BLAST follows the conventions of PROSITE. When using the stand-alone program, it is permissible to have multiple patterns in a file separated by a blank line between patterns. When using the Web-page only one pattern is allowed per query. Valid protein characters for PHI-BLAST patterns: ABCDEFGHIKLMNPQRSTVWXYZU Valid DNA characters for PHI-BLAST patterns: ACGT Other useful delimiters: [ ] means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T - means nothing (this is a spacer character used by PROSITE) x with nothing following means any residue x(5) means 5 positions in which any residue is allowed (and similarly for any other single number in parentheses after x) x(2,4) means 2 to 4 positions where any residue is allowed, and similarly for any other two numbers separated by a comma; the first number should be < the second number. > can occur only at the end of a pattern and means nothing it may occur before a period (another spacer used by PROSITE) . may be used at the end of the pattern and means nothing When using the stand-alone program, the pattern should be in a file, with the first line starting: ID followed by 2 spaces and a text string giving the pattern a name. There should also be a line starting PA followed by 2 spaces followed by the pattern description. All other PROSITE codes in the first two columns are allowed, but only the HI code, described below is relevant to PHI-BLAST. Here is an example from PROSITE. ID CNMP_BINDING_2; PATTERN. AC PS00889; DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE). DE Cyclic nucleotide-binding domain signature 2. PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]. NR /RELEASE=32,49340; NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=1; /PARTIAL=1; CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2; The line starting ID gives the pattern a name. The lines starting AC, DT, DE, NR, NR, CC are relevant to PROSITE users, but irrelevant to PHI-BLAST. These lines are tolerated, but ignored by PHI-BLAST. The line starting PA describes the pattern as: one of LIVMF followed by G followed by E followed by any single character followed by one of GAS followed by one of LIVM followed by any 5 to 11 characters followed by R followed by one of STAQ followed by A followed by any single character followed by one of LIVMA followed by any single character followed by one of STACV In this case the pattern ends with a period. It can end with nothing after the last specifying symbol or any number of > signs or periods or combination thereof. Here is another example, illustrating the use of an HI line. ID ER_TARGET; PATTERN. PA [KRHQSA]-[DENQ]-E-L>. HI (19 22) HI (201 204) In this example, the HI lines specify that the pattern occurs twice, once from positions 19 through 22 in the sequence and once from positions 201 through 204 in the sequence. These specifications are relevant when stand-alone PHI-BLAST is used with the seedp option, in which the interesting occurrences of the pattern in the sequence are specified. In this case the HI lines specify which occurrence(s) of the pattern should be used to find good alignments. In general, the seedp option is more useful than the standard patternp option ONLY when the pattern occurs K > 1 times in the sequence AND the user is interested in matching to J < K of those occurrences. Then using the HI lines enables the user to specify which occurrences are of interest. Additional functionality related to PHI-BLAST. PHI-BLAST takes as input both a sequence and a query containing that sequence and searches a sequence database for other sequences containing the same pattern and having a good alignment. One may be interested in asking two related, simpler questions: 1. Given a sequence and a database of patterns, which patterns occur in the sequence and where? 2. Given a pattern and a sequence database, which sequences contain the pattern and where? These queries can be answered wih software closely related to PHI-BLAST, but they do not fit into the output framework of BLAST because the answers are simple lists without alignments and with no notion of statistical significance. The NCBI toolbox includes another program, currently called seedtop to answer the two queries above. Query 1 can be asked with: seedtop -i -k -p patmatchp Query 2 can be asked with: seedtop -d -k -p patternp The -k argument is used similarly in all queries and the file format is always the same. The standard pattern database is PROSITE, but others (or a subset) can be used. There are plans afoot to offer the patmatchp query (number 1) on the PHI-BLAST web page or in its vicinity, but this would be restricted to having PROSITE as the pattern database. References Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L. Madden, David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998), "Protein sequence similarity searches using patterns as seeds", Nucleic Acids Res. 26:3986-3990. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-68. Karlin, Samuel and Stephen F. Altschul (1993). Applications and statistics for multiple high-scoring segments in molecu- lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7. Schaffer, Alejandro A., L. Aravind, Thomas L. Madden, Sergei Shavirin John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), Improving PSI-BLAST Protein Database Search Sensitivity with Composition-Based Statistics and Other Refinements. Nucleic Acids Res. 29:2994-3005. Release History --------------- Notes for 2.2.4 release (08/26/02): Enhancements: 1.) Discontiguous word matching is now available for megablast. See http://www.ncbi.nlm.nih.gov/blast/discontiguous.html for details. 2.) An out-of-frame gapping option (meaning that one or two bases can be inserted or deleted from an alignment) is now available in blastall for blastx and tblastn. NOTE that the expect values have been calculated assuming in-frame gapping (three bases inserted/deleted) and should only be used for guidance. 3.) Fastacmd can now dump out partial sequences (using the -L option) and print taxonomic information for a sequence. Bug fixes: 1.) A problem that caused blastall to core-dump when the -U option (mask the sequence that is lower-case in input file) has been fixed. 2.) A problem that caused bl2seq to not work properly for protein-protein searches with BLOSUM62 (on some platforms) has been fixed. 3.) A problem that caused seedtop to core dump if there were a lot of hits has been fixed. 4.) Using -n with blastall (megablast mode) now returns the same results as default megablast. 5.) XML output for megablast has been fixed. 6.) A problem with translating rpsblast that caused it to crash on OSF/1 and report incorrect values on other platforms has been fixed. 7.) A memory leak in formatdb was fixed. 8.) A problem that caused blastpgp to core-dump when running in PHI-BLAST mode (if many hits were found) was fixed. Memory leaks were also fixed. 9.) the double closing of a file that caused phi-blast to crash occassionally under LINUX has been fixed. Notes for 2.2.3 release (04/24/02): Enhancements: 1.) Version 4 of the BLAST databases is now the default for formatdb. This can be overridden for older binaries by use of "-A F" on the command-line. Bug fixes: 1.) A problem has been fixed that caused tblastn searches to miss some protein matches, if the database sequence was longer than 15 million bases. 2.) Selenocysteine residues (U) in the query are now replaced by X's as these are not supported in the currently available matrices (e.g., BLOSUM62), so that their presence occasionally caused data corruption. 3.) A problem with combining the "-m 7" and "-n T" options in blastall has been fixed. 4.) XML output had a <Hit_def> field that could (incorrectly) have an empty value, this has been fixed. 5.) A problem with reading databases with more than one volume and an oidlist has been fixed. 6.) A problem with ungapped XML output that caused all HSP's to be number zero has been resolved, they are now numbered with one-offset. 7.) A bug that prevented use of some matrices for ungapped searches has been fixed. 8.) Effective query and database lengths were calculated incorrectly for rpsblast, leading to a minor change in expect values in some cases. This has been corrected. 9.) A for loop that could overrun the end of a buffer during formatting was fixed. Many thanks to Haruna Cofer of SGI for pointing this. 10.) The effective database length command-line argument (-z) has been fixed for blastall and megablast. The parser was reading digits only until there were no non-digits (e.g., 1.6e8 was interpreted as "1"), leading to wildly incorrect effective database lengths. This has been fixed so that 160000000 and 1.6e8 are interpeted the same way. Notes for 2.2.2 release: Enhancements: 1.) Version 4 of the BLAST databases is now fully supported. This version has some enhancements described in README.formatdb and fixes some problems described below. Use the "-A" option on formatdb to produce the new database version. The BLAST binaries for release 2.2.2 are entirely compatiable with both the current and the new version of the BLAST databases. Old BLAST binaries are not necessarily compatiable with the new database format. 2.) Fastacmd will dump out an entire BLAST database in FASTA format if the new -D option is used. 3.) Fastacmd will separate definition lines from different GI's that have been merged together in nr (as they all have the same sequence) by control-A's. if the new -c option is used. Bug fixes: 1.) A problem has been fixed that caused tblastn searches to miss some protein matches, if the database sequence was longer than 15 million bases. 2.) The old (current) version of the BLAST databases has a "rollover" problem if the total number of bases in a single volume is greater than 4294967295. The new database verison (#4) allows eight bytes for this. 3.) The old (current) version of the BLAST database format does not handle ambiguity characters in a nucleotide database sequence if it is over 16 million characters long. The new version of the the BLAST database does. 4.) A performance problem that caused a mutexes to be acquired too often for multi-threaded runs with four or more CPU's has been fixed. Thanks to Haruna Cofer of SGI for help in finding the cause. 5.) A problem that caused ungapped blastp/blastx/tblastn/tblastx to crash on certain matrices (e.g., pam10) has been fixed. 6.) Some blastpgp problems with using the -B (for reading a master-slave alignment) and reading checkpoint files (-C) have been resolved. Notes for 2.2.1 release: Enhancements: 1.) BLAST and PSI-BLAST improvements as described in Schaffer et al., Nucleic Acids Research 2001 Jul 15;29(14):2994-3005. These include improvements the use of composition-based statistics and improvements to the edge-correction effects. Composition-based statistics were initially implemented in release 2.1.1, but the implementation is improved in release 2.2.1. 2.) Formatdb automatically produces database volumes for input consisting of more than 4 billion letters. 3.) Formatdb can produce an alias file for a given database and GI list as well as convert a GI list to the more efficient binary format. See details in README.formatdb. 4.) RPSBLAST now works properly with 'scaled' databases. The scaling factor must be set when executing the program 'makemat' (which takes PSI-BLAST checkpoints as input). Scaling-up the matrix improves the precision of the (integer) calculations. 5.) Tabular output has now been added to blastpgp and rpsblast, use the "-m 8" option. 6.) Blastpgp will now process multiple queries. Bug fixes: 1.) A problem with the -K option (for culling) that caused BLAST to crash has been fixed. 2.) A problem with the "gnl" identifier and multi-volume databases has been fixed. 3.) A problem that caused BLASTN to very rarely find suboptimal alignments has been fixed. 4.) A problem that could cause makemat to crash has been fixed. 4.) Some multi-threading problem pointed out by Henry Gabb of KAI were fixed. 5.) Some PC-lint errors and warnings pointed out by Russ Williams of United Devices were fixed. Notes for 2.1.3 release: Enhancements: 1.) Addition of PSI-TBLASTN ability to blastall, see description in README.bls. 2.) Database sequences over 5 million bases in length are now broken into chunks to keep memory usage reasonable. 3.) Blastall now allows one to enter a location if it is desired to search a subsequence of the query. 4.) Formatdb can produce a new BLAST database format using the -A option. The BLAST programs can read this format as well as the current format (the program automatically identifies which version it should work with). This new format stores the sequence definition lines in a structured manner (as ASN.1), this will allow future versions of BLAST to better present taxonomic information as well as information about other resources (e.g., UniGene, LocusLink) for a database sequence. 5.) Blastall can now produce tab-delimited, use "-m 8" to specify this. 6.) Improved Karlin-Altschul parameters are now being used, they were calculated using the "island" method 7.) A "gapped" check was added to BLASTN to ensure that if a hit is low-scoring after an ungapped extension, but high-scoring after a gapped extension, it will not be missed. 8.) The formatdb error messages have been improved for the case of illegal characters in the sequence. 9.) The number of HSP's saved in an ungapped search has been increased to 400 from 200. Bug fixes: 1.) A problem with XML output was fixed. 2.) A problem with the seg filtering under LINUX was fixed (many thanks to Eric Cabot at GCG for pointing this out). 3.) A problem with format of BLAST reports if the "-o" flag was not used when the database was produced was fixed (thanks again to Eric Cabot). 4.) A problem with reading the BLAST database caused by a 4-byte signed integer than should have been unsigned was fixed (thanks to Haruna Cofer at SGI for pointing this out). 5.) A problem with copymat under NT and IRIX was fixed. Notes for 2.1.2 release: Enhancements: 1.) Release of rpsblast. Rpsblast performs a search against a database of profiles. See README.rps for full details. 2.) Release of blastclust. BLASTCLUST automatically and systematically clusters protein sequences based on pairwise matches found using the BLAST algorithm. See README.bcl for full details. 3.) Release of megablast. Megablast uses the greeedy algorithm of Webb Miller et al. for nucleotide sequence alignment search and concatenates many queries to save time spent scanning the database. See README.mbl for full details. 4.) XML output can now be produced. Use the '-m 7' option for this. The XML output is still experimental. 5.) the default behavior the culling (-K) option has been changed. Previously this option was set to 100, meaning that if more than 100 HSP's had a hit to a region lower scoring ones would be dropped. The option is now zero, which turns off this behavior. In a few cases this change will result in more database sequences being reported. The previous behavior can be recovered by using '-K 100' on the command-line. Bug fixes: 1.) A bug that caused only the last SeqAnnot to be written (if the -O option was used) when multiple sequences were searched has been fixed. All SeqAnnots are printed out. 2.) A bug that caused the search space (set on the command line with the -Y option) to be ignored for some blastx and tblastn calculations has been fixed. 3.) A failure to close a file if a gilst was used (using the -l option) was fixed. Many thanks to David Mathog at CalTech for spotting this problem and suggesting a fix. 4.) A bug that caused all the database names listed in an alias file to be printed, rather than the "TITLE" field has been fixed. Notes for 2.1.1: Enhancements: 1.) Addition of compostion-based statistics: BLAST and PSI-BLAST now permit calculated E-values to take into account the amino acid composition of the individual database sequences involved in reported alignments. This improves E-value accuracy, thereby reducing the number of false positive results. The improved statistics are achieved with a scaling procedure [1,2] which in effect employs a slightly different scoring system for each database sequence. As a result, raw BLAST alignment scores in general will not correspond precisely to those implied by any standard substitution matrix. Furthermore, identical alignments can receive different scores, based upon the compositions of the sequences they involve. The improved statistics are now used by default for all rounds of searching on the PSI-BLAST page, but not on the BLAST page. Therefore, if one uses default settings, the results of the first round of searching will be different on the BLAST and PSI-BLAST pages. In addition adjustments have been made to two PSI-BLAST parameters: the pseudocount constant default has been changed from 10 to 7, and the E-value threshold for including matches in the PSI-BLAST model has been changed from 0.001 to 0.002. 1. Altschul, S.F. et al. (1997) Nucl. Acids Res. 25:3389-3402. 2. Schäffer, A.A. et al. (1999) Bioinformatics 15:1000-1011. Notes for 2.0.14 release: Bug fixes: 1.) extra line returns between sequences in the a FASTA file causes formatdb to produce corrupted databases. 2.) ";" at the beginning of a line was not being treated as a comment. 3.) a problem with the formatter causes blast to core-dump if the FASTA definition line only contains an identifier and no description. 4.) a problem in the ungapped extension for protein sequences causes a rare problem. 5.) the '-U' option that causes lower-case sequence to be masked does not work correctly for blastx. Notes for 2.0.13 release: Enhancements: 1.) The output format for pairwise alignments was changed to put each new gi (if the sequence has redundant gi's) on a new line. If HTML output is specified then each gi is hyperlinked. Bug fixes: 1.) An NCBI toolkit problem parsing the new RefSeq format in FASTA files (two bars instead of three) was fixed. This fix applies to all BLAST binaries (formatdb, blastall, blastpgp, etc.). 2.) A problem that caused BLAST version 2.0.12 under NT to freeze in multithreaded mode has been fixed. Notes for 2.0.12 release: Enhancements: 1.) Bl2seq can now perform nucleotide-protein (blastx style) comparisons. This necessitated changing the '-p' option from a Boolean to a string. Valid arguments are "blastn", "blastp", or "blastx". Bug fixes: 1.) A problem in the NCBI threads library that caused BLAST to sometimes stick was corrected. Many thanks to Haruna Cofer and colleauges at SGI for providing a fix. 2.) A problem that caused BLAST to core-dump (especially on long queries) has been fixed. Many thanks to Gary Williams for providing examples. 3.) A problem that prevented the search of multiple multivolume databases has been fixed. Notes for 2.0.11 release: Enhancements: 1.) Optimizations were contributed by Chris Joerg of COMPAQ. These changes reduce the number of cache misses, unroll loops, and make some instructions unnecessary. These improvements can speed up BLAST for long sequences several-fold. 2.) A database is now only memory-mapped while being searched. If multiple databases are searched and the total exceeds the allowed memory-map limit this allows all databases to be searched as memory-mapped files. If a database cannot be memory-mapped it is read as an ordinary file, rather than causing an error. Bug fixes: 1.) Formatdb was fixed to correct a problem with FASTA string identifiers under NT. 2.) Blastpgp was fixed to prevent a core-dump under LINUX 3.) BLASTN was found to miss some hits near the expect value cutoff. This has been corrected. Notes for 2.0.10 release: Enhancements: 1.) Bl2seq, a utility to compare two sequences using the blastn or blastp approach, is included in the archive. See the full description in the README.bls for details. 2.) A 'sparse' option ('-s') has been added to formatdb. This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's. 3.) A volume option ('-v') has been added to formatdb. This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal', is written. This option should be used if one wishes to formatdb large databases (e.g., over 2 billion base pairs). 4.) It is is now possible to jump start the command line version of PSI-BLAST (blastpgp) from a multiple alignment that includes the query sequence using the -B option. Details are in README.bls. 5.) The maximum wordsize limit for BLASTN has been removed. Bug fixes: 1.) A problem if the database length, set by the '-z' option was greater than 2 billion, was fixed. 2.) A core-dump that resulted from the use of the coil-coil masking ('-F C') was fixed by including a file needed for the data directory. 3.) A bug was fixed that caused some very short alignments to be assigned incorrect expect values. 4.) A bug was fixed that caused formatdb to produce incorrect BLAST databases if the input was ASN.1. 5.) A serious performance problem with BLASTN and longer words (greater than 16) was fixed. Notes for 2.0.9 release: Enhancements: 1.) two new options have been added to blastall: to produce output in HTML and to search a subset of the database based upon a list of GI's. Please see the options section for full information. 2.) two new options have been added to blastpgp: to produce HTML output and to produce an ASCII version of the PSI-BLAST Matrix. Please see the options section for more information. 3.) formatdb has a new option to allow specification of a 'base' name. see the options section for full details. 4.) it is possible to mask only during the phase when the lookup table is being built, but not during the extensions. See the options section for full details. Bug fixes: 1.) a problem that occurred when too many HSP's aligned to the same part of the query from one database sequence has been fixed. 2.) a problem that caused seedtop to not perform pattern-matching for DNA sequences has been fixed. 3.) the number of HSP's saved for ungapped BLAST and tblastx is now limited to 200 to prevent problems with memory and speed. 4.) a missing thread join that caused problems under DEC Alpha has been added. 5.) a formatting problem with the database summary at the beginning of the BLAST output (if multiple databases totaling over 2 Gig) has been fixed. 6.) a bug in formatdb that caused a core-dump if the total number of sequences was an exact multiple of 100000 was fixed. Notes for 2.0.8 release: Enhancements: 1.) Frame and strand information was added to the output. Examples of the new output format may be found at http://www.ncbi.nlm.nih.gov/BLAST/example.html. 2.) An option that specifes the query strand to be searched (for blastn, blastx, and tblastx) has been added. The option is '-S'. Bug fixes: 1.) The problem with the 'too-wide' parameter input screen under NT was fixed. 2.) BLAST no longer core-dump's when the query is NULL. 3.) BLAST no longer core-dump's when the query contains an '@' and blastx or tblastx is selected. Notes for 2.0.7 release: Bug fixes: 1.) BLAST now multi-threads properly under LINUX. 2.) A problem with very redundant databases and psi-blast was fixed. 3.) A problem with the formatting of the number of identities and positives was fixed. This affected results on the minus strand only and did not affect the expect value or scores. 4.) A problem that caused tblastn to core-dump very occassionally was corrected. 5.) A problem with multiple patterns in PHI-BLAST was fixed. 6.) A limit on the number of HSP's that were saved (100) was removed. Notes for 2.0.6 release: Enhancements: 1.) PHI-BLAST is included in this release. Please see notes on PHI-BLAST for details. 2.) SEG has become an integral part of the NCBI toolkit and it is no longer necessary to install it separately. It is also now supported under non-UNIX platforms. 3.) Access to filtering options. If one uses "-F T" then normal filtering by seg or dust (for blastn) occurs (likewise "-F F" means no filtering whatsoever). The seg options can be changed by using: -F "S 10 1.0 1.5" which specifies a window of 10, locut of 1.0 and hicut of 1.5. One may also specify coiled-coiled filtering by specifying: -F "C" There are three parameters for this: window, cutoff (prob of a coil-coil), and linker (distance between two coiled-coiled regions that should be linked together). These are now set to window: 22 cutoff: 40.0 linker: 32 One may also change the coiled-coiled parameters in a manner analogous to that of seg: -F "C 28 40.0 32" will change the window to 28. One may also run both seg and coiled-coiled together by using a ";": -F "C;S" 4.) BLAST has been changed to reduce the number of redundant hits that a user may see. This is acheived by keeping track of the number of hits completely contained in a certain region and eliminating those lower scoring hits that are redundant with others. This behavior may be controlled with the -K and -L options: -K Number of best hits from a region to keep [Integer] default = 50 -L Length of region used to judge hits [Integer] default = 20 Setting -K to zero turns off this feature. This is the default only on blastall. Bug fixes: 1.) There was a problem with the procedure that called the external utility seg. The need to fix this was obviated by the integration of seg into the toolkit. This showed up under LINUX. 2.) There was a memory problem with formatdb that has been fixed. This showed up mostly under NT and LINUX. 3.) A problem with running in multi-processing mode under IRIX6.5 (as a non-root user) was fixed. Notes for 2.0.5 release: Enhancements: 1.) The BLAST version is printed by formatdb in it's log file. 2.) Multi-database searches no longer require that the -o option be used when preparing the databases (i.e., with formatdb). Bugs fixed: 1.) A serious bug with multi-database iterative searches was fixed (thanks to Steve Brenner for providing an example). 2.) 'lcl' is not formatted in the BLAST report when the sequence identifier is a local identifier or does not contain a bar ("|"). 3.) A large memory leak in formatdb was fixed. 4.) An unnecessary cast that caused formatdb to fail on Solaris 2.5 machines if the binary was made under 2.6 was fixed. 5.) Better error checking was added to protect against core-dumps. 6.) Some problems with the sum statistics treatment of the blastx and tblastn programs reported by D. Rozenbaum were fixed. The number of alignments involved in a sum group was misrepresented. Also the incorrect length for the database sequence was used, sometimes casuing a slight change in the value reported. 7.) A problem with blastpgp was fixed that reported incorrect values for matrices other than BLOSUM62 during iterative searches. Notes for 2.0.4 release: Enhancements: 1.) multiple database searches: Version 2.0.4 will accept multiple database names (bracketed by quotations). An example would be -d "nr est" which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database. 2.) new options: -W Word size, default if zero [Integer] default = 0 -z Effective length of the database (use zero for the real size) [Integer] default = 0 3.) The number of identities, positives, and gaps are now printed out before the alignments for gapped blastx, tblastn, and tblastx. Additionally this feature is now also enabled for ungapped BLAST. 4.) Formatdb now accepts ASN.1, as well as FASTA, as input. Bugs fixed: 1.) In blastx, tblastn, and tblastx a codon was incorrectly formatted as a start codon in some cases. 2.) The last alignment of the last sequence being presented was incorrectly dropped in some cases. This change could affect the statistical significance of the last database sequence if the dropped alignment had a lower e-value than any other alignments from the same database sequence.
About
Wrapper script that calls formatdb on nr database and then runs blastall against formatted db and parses the output.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published