Skip to content

Wrapper script that calls formatdb on nr database and then runs blastall against formatted db and parses the output.

Notifications You must be signed in to change notification settings

lvn3668/paralogIdentification

Repository files navigation

	
			README for stand-alone BLAST
			   (last updated 08/26/2002)



This document provides information on stand-alone BLAST.  Topics covered are
setting up stand-alone BLAST, command-line options for stand-alone BLAST,
and a release history of the different versions.

BLAST binaries are provided for IRIX6.2, Solaris2.6 (Sparc) Solaris2.7 (Intel), 
DEC OSF1 (ver. 4.0D), LINUX/Intel, HPUX, MacIntosh, and Win32 systems.
We will attempt to produce binaries for other platforms upon request.

Stand-alone binaries are available from ftp://ftp.ncbi.nih.gov/blast/executables/

Please remember to FTP in binary mode.


Setting up Standalone BLAST for UNIX:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Basically, there are three steps needed to setup the Standalone BLAST
executable for the UNIX platform.

1) Download the UNIX binary, uncompress and untar the file. It is
suggested that you do this in a separate directory, perhaps called
"blast".

2) Create a .ncbirc file. In order for Standalone BLAST to operate, you
have will need to have a .ncbirc file that contains the following lines:

[NCBI] 
Data="path/data/"

Where "path/data/" is the path to the location of the Standalone BLAST
"data" subdirectory. For Example: 

Data=/root/blast/data

The data subdirectory should automatically appear in the directory where
the downloaded file was extracted. Please note that in many cases it may
be necessary to delimit the entire path including the machine name and
or the net work you are located on. Your systems administrator can help
you if you do not know the entire path to the data subdirectory.

Make sure that your .ncbirc file is either in the directory that you
call the Standalone BLAST program from or in your root directory.

3) Format your BLAST database files. The main advantage of Standalone
BLAST is to be able to create your own BLAST databases. This can be done
with any file of FASTA formatted protein or nucleotide sequences. If you
are interested in creating your own database files you should refer to
the sections "Non-redundant defline syntax" and "Appendix 1: Sequence
Identifier Syntax" of the README in the BLAST database directory
(ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA
description available from the BLAST search pages 
(http://www.ncbi.nlm.nih.gov/BLAST/fasta.html). 

However, for a testing purposes you should download one of the NCBI
databases and run a search against it.

In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/)
you will find the downloadable BLAST database files.  For your first
search we recommend downloading something relatively small like
ecoli.nt.Z (1349 Kb).  This is a FASTA formatted file of nucleotide
sequences which is also compressed.  Once uncompressed, you will need to
format the database using the 'formatdb' program which comes with your
Standalone BLAST executable. The list of arguments for this program and
all other BLAST programs are located at the end of the README in the
Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or 
you can get these arguments by running each of the BLAST programs (formatdb, 
blastall etc.) with a single hyphen as the argument (Example: formatdb -). For
this document we are just going to show you the basic commands for formatting 
the database and running your first search.

To format the ecoli.nt database run the following from the command
line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to
perform the searches and produce results. The ecoli.nt file is not
needed after formatdb has been done and you can delete this.

Next create a test nucleotide file to run against the new database.  It
may be easier to 'cheat' here and just extract a portion of a
nucleotide sequence you know is in the downloaded ecoli.nt database.
Make a text file called test.txt with the following sequence:

>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run the first search enter the following command from the UNIX
command line in your BLAST directory:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone
BLAST directory. 

Now you are ready to create your own databases and run BLAST searches.
For more information you should refer to the Standalone BLAST README (
ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. 
This will give you some idea of all the programs BLAST supports and the
use of different parameters for increasing or decreasing the stringency
of your results.

If you have any questions please send them to the
[email protected] e-mail address.


Setting up Standalone BLAST for Windows
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

There are three steps needed to setup the Standalone BLAST
executable.

1) Download and compress the Standalone BLAST Windows binary
blastcz.exe. We suggest doing this in it's own directory, perhaps called
blast. This is a 'self-extracting' archive and all you need to do is run
this either through a Command Prompt (DOS Prompt) or by selecting "Run"
from the Windows "Start button" and browsing the blastcz.exe file.

2) Create an ncbi.ini file. In order for Standalone BLAST to operate,
you have will need to have an ncbi.ini file that contains the following
lines:

[NCBI] 
Data="C:\path\data\"

Where "C:path\data\" is the path to the location of the Standalone
BLAST "data" subdirectory. For example: 

Data=C:\blast\data

This data subdirectory should automatically appear in the directory
where the downloaded file was extracted.

Make sure that your ncbi.ini file is in the Windows or WINNT directory
on your machine. Note: If you already have an ncbi.ini file on your
machine from installing other NCBI software(Network Entrez, Sequin etc.)
you can skip this section. However, if you see the following error
message, you should rename the old ncbi.ini file to something like
ncbi.bak and follow the instructions in number 2 above.

Abrupt: code=1
FATAL ERROR: FindPath failed. 

C) The main advantage of Standalone BLAST is to be able to create your
own BLAST databases. This can be done with any file of FASTA formatted
protein or nucleotide sequences. If you are interested in creating your
own database you should refer to the sections "Non-redundant defline
syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in
the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can
also refer to the FASTA description available from the BLAST search
pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html). 

However, for a testing purposes you should download one of the NCBI
databases and run a search against it.

In the BLAST database FTP directory ftp://ftp.ncbi.nih.gov/blast/db/
you will find the downloadable BLAST database files. For your first
search we recommend downloading something relatively small like
ecoli.nt.Z (1349 Kb).  This is a FASTA formatted file of nucleotide
sequences which is also compressed. (If you do not have a copy of UNIX
"uncompress" for your Windows PC contact NCBI Info at
[email protected]).

Once uncompressed, you will now need to format the database using the
'formatdb' program which comes with your Standalone BLAST executable.
The list of arguments for this program and all other BLAST programs are
located at the end of the README in the Standalone BLAST FTP directory
(ftp://ftp.ncbi.nih.gov/blast/executable/). Or you can get these
arguments by running each of the BLAST programs (formatdb, blastall
etc.) with a single hyphen as the argument (Example: formatdb -). For
this document we are just going to show you the basic commands for
formatting the database and running your first search.

To format the ecoli.nt database run the following from the command
line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to
perform the searches and produce results. The ecoli.nt file can be
removed once formatdb has been run.

Next create a test nucleotide file to run against the new database.  It
may be easier to 'cheat' here and just extract a portion of a
nucleotide sequence you know is in the downloaded ecoli.nt database.
So  make a text file called test.txt with the following sequence:

>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run the first search just do the command:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone
BLAST directory. Now you are ready to create your own databases and run
BLAST searches. For more information you should refer to the Standalone
BLAST README ( ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST
literature.  This will give you some idea of all the programs BLAST
supports and the use of different parameters for increasing or
decreasing the stringency of your results.

If you have any questions please send them to the
[email protected] e-mail address.


SGI Note:
---------

SGI recommends the following threads patches on IRIX6 systems:

   For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order)
   For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order)
   For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order)

These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/

System recommendations:
----------------------

BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if
it can read the entire BLAST database into memory, then keep on using it
there. Resources consumed reading a database into memory can easily
outweight the cost of a BLAST search, so that the memory of a machine is
normally more important than the CPU speed. This means that one should have
sufficient memory for the largest BLAST database one will use, then run all
the searches against this databases in serial, then run queries against
another database in serial. This guarantees that the database will be read
into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg,
which translates to about 170-200 Meg of BLAST database. At least another
100-200 Meg should be allowed for memory consumed by the actual BLAST
program. All of the FASTA databases together are about 1.5 Gig, the BLAST
databases produced from this will probably be about another Gig or so. 4 Gig
of disk space, to make room for software and output, is probably a pretty
good bet.

OSF1 and limit
--------------

Some OSF1 users have encountered "out of memory" problems when running searches
even though there seems to be plenty of memory on the machine and the search
runs well on other platforms.  The error message would look something like:

[blastall] FATAL ERROR: CoreLib [001.000]  gi|509180|emb|X71670.1|MMP17SAR: Failed to allocate 480 bytes

Often it is sufficient to simply raise the "datasize" limit, which specifies
the maximum allowed heap size.  The "datasize" limit can be changed by executing:

limit datasize unlimited

Note that this change only applies to the current session, so it is advisable to place
this command in some file sourced at startup, such as .login or .cshrc.


BLAST OPTIONS
-------------

Formatdb
--------

There is now a separate document describing formatdb (README.formatdb).  Please
refer to it for information on formatting FASTA files for BLAST searches.


Blastall
--------

Blastall may be used to perform all five flavors of blast comparison. One
may obtain the blastall options by executing 'blastall -' (note the dash). A
typical use of blastall would be to perform a blastn search (nucl. vs. nucl.) 
of a file called QUERY would be:

blastall -p blastn -d nr -i QUERY -o out.QUERY

The output is placed into the output file out.QUERY and the search is performed
against the 'nr' database.  If a protein vs. protein search is desired,
then 'blastn' should be replaced with 'blastp' etc.

Some of the most commonly used blastall options are:

blastall   arguments:

  -p  Program Name [String]

        Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".

  -d  Database [String]
    default = nr

        The database specified must first be formatted with formatdb.
        Multiple database names (bracketed by quotations) will be accepted.
        An example would be

                -d "nr est"

        which will search both the nr and est databases, presenting the results as if one
        'virtual' database consisting of all the entries from both were searched.   The
        statistics are based on the 'virtual' database of nr and est.  

  -i  Query File [File In]
    default = stdin

        The query should be in FASTA format.  If multiple FASTA entries are in the input
        file, all queries will be searched.

  -e  Expectation value (E) [Real]
    default = 10.0

  -o  BLAST report Output File [File Out]  Optional
    default = stdout

  -F  Filter query sequence (DUST with blastn, SEG with others) [String]
    default = T

         BLAST 2.0 and 2.1 uses the dust low-complexity filter for blastn and seg for the
         other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit
         and are accessed automatically.

         If one uses "-F T" then normal filtering by seg or dust (for blastn)
         occurs (likewise "-F F" means no filtering whatsoever).  

         This options also takes a string as an argument.  One may use such a 
         string to change the specific parameters of seg or invoke other filters.
         Please see the "Filtering Strings" section (below) for details.

  -S  Query strands to search against database (for blast[nx], and tblastx).  3 is both, 1 is top, 2 is bottom [Integer]
    default = 3

  -T  Produce HTML output [T/F]
    default = F

  -l  Restrict search of database to list of GI's [String]  Optional

	This option specifies that only a subset of the database should be
	searched, determined by the list of gi's (i.e., NCBI identifiers) in a 
	file.  One can obtain a list of gi's for a given Entrez query from
	http://www.ncbi.nlm.nih.gov/Entrez/batch.html.  This file should
	be in the same directory as the database, or in the directory that
	BLAST is called from.

  -U  Use lower case filtering of FASTA sequence [T/F]  Optional
    default = F

        This option specifies that any lower-case letters in the input FASTA file
        should be masked.  


   Documentation for PSI-TBLASTN

PSI-BLASTN is a variant of blastall that searches a protein query
sequence against a nucleotide sequence database using a position
specific matrix created by PSI-BLAST. The nucleotide sequence database
is dynamically translated in all reading frames during PSI-TBLASTN
search. Using a position specific matrix may enable finding more
distantly related sequences.

Programs: 
blastpgp 	[takes a protein query and perform PSI-BLAST search to 
		creates a position specific matrix using a protein 
		database]

blastall 	[reads position specific matrix and performs PSI-TBLASTN 
		search]

Usage:
A user would typically run blastpgp to create and save a position
specific matrix, followed by a run of blastall for PSI-TBLASTN search.

blastpgp must be executed with -C option followed by a file name to
save position specific score matrix.

blastall with "-p psitblastn" option executes PSI-TBLASTSN search, and
-R option followed by a file name specifying the file that contains
position specific score matrix. All other options that apply when
using "blastall -p tblastn ..." also apply when using "blastall -p
psitblastn ...", but there are some restrictions to parameters: 1) The
query must be the same as the one used in blastpgp for creating a
position specific matrix. 2) By default, blastpgp has filtering off
(-F F) and blastall has filtering on (-F T). To ensure consistent
usage of the blastpgp/psitblastn combination, the -F option should be
explicitly set in one or the other run.


Example: 
One may run PSI-BLST to create and save a position specific score matrix
as follows: 

	blastpgp -d nr -i ff.chd -j 2 -C ff.chd.ckp

Position specific score matrix is saved in ff.chd.ckp. Then, using 
this matrix, one may run PSI-TBLASTN search:

	blastall -i ff.chd -d yeast -p psitblastn -R ff.chd.ckp

Note that this allows the score matrix to be constructed using one
database (nr in the example) and then used to search a second database
(yeast in the example). Even if the two database names are the same,
blastpgp uses the protein version while "blastall -p psitblastn" uses
the DNA version.



Blastpgp
--------

Blastpgp performs gapped blastp searches and can be used to perform
iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and
PHI-BLAST sections (below) for a description of this binary. The options may be
obtained by executing 'blastpgp -'.

  -T  Produce HTML output [T/F]
    default = F

  -Q  Output File for PSI-BLAST Matrix in ASCII [File Out]  Optional

Bl2seq
------

Bl2seq performs a comparison between two sequences using either the blastn or
blastp algorithm.  Both sequences must be either nucleotides or proteins.
The options may be obtained by executing 'bl2seq -'.

  -i  First sequence [File In]
  -j  Second sequence [File In]
  -p  Program name: blastp, blastn, blastx. For blastx 1st argument should be nucleotide [String]
    default = blastp
  -g  Gapped [T/F]
    default = T
  -o  alignment output file [File Out]
    default = stdout
  -d  theor. db size (zero is real size) [Integer]
    default = 0
  -a  SeqAnnot output file [File Out]  Optional
  -G  Cost to open a gap (zero invokes default behavior) [Integer]
    default = 0
  -E  Cost to extend a gap (zero invokes default behavior) [Integer]
    default = 0
  -X  X dropoff value for gapped alignment (in bits) (zero invokes default behavior) [Integer]
    default = 0
  -W  Wordsize (zero invokes default behavior) [Integer]
    default = 0
  -M  Matrix [String]
    default = BLOSUM62
  -q  Penalty for a nucleotide mismatch (blastn only) [Integer]
    default = -3
  -r  Reward for a nucleotide match (blastn only) [Integer]
    default = 1
  -F  Filter query sequence (DUST with blastn, SEG with others) [String]
    default = T
  -e  Expectation value (E) [Real]
    default = 10.0
  -S  Query strands to search against database (blastn only).  3 is both, 1 is top, 2 is bottom [Integer]
    default = 3
  -T  Produce HTML output [T/F]
    default = F


Fastacmd
--------

Fastacmd retrives FASTA formatted sequences from a BLAST database, if it was formatted
using the '-o' option.  An example fastacmd call would be:

fastacmd -d nr -s p38398

The fastacmd options are:

fastacmd   arguments:

  -d  Database [String]
    default = nr
  -s  Search string: GIs, accessions and locuses may be used delimited
      by comma or space) [String]  Optional
  -i  Input file wilth GIs/accessions/locuses for batch retrieval [String]  Optional
  -a  Retrieve duplicated accessions [T/F]  Optional
    default = F
  -l  Line length for sequence [Integer]  Optional
    default = 80



Filtering Strings
-----------------

         The -F argument can take a string as input specifying that seg should be
         run with certain values or that other non-standard filters should be used.
         This sections describes this syntax.

         The seg options can be changed by using:

         -F "S 10 1.0 1.5"

         which specifies a window of 10, locut of 1.0 and hicut of 1.5.  

         A coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991)) 
         and written by John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked
         by specifying:

         -F "C"

         There are three parameters for this: window, cutoff (prob of a coil-coil), and
         linker (distance between two coiled-coiled regions that should be linked
         together).  These are now set to

         window: 22
         cutoff: 40.0
         linker: 32

         One may also change the coiled-coiled parameters in a manner analogous to
         that of seg:

         -F "C 28 40.0 32" will change the window to 28.

         One may also run both seg and coiled-coiled together by using a ";":

         -F "C;S"

         Filtering by dust may also be specified by:

         -F "D"

         It is possible to specify that the masking should only be done during
         the process of building the initial words by starting the filtering
         command with 'm', e.g.:

         -F "m S"

         which specifies that seg (with default arguments) should be used for masking, 
         but that the masking should only be done when the words are being built.  
         This masking option is available with all filters.

         If the -U option (to mask any lower-case sequence in the input FASTA file) is used and
         one does not wish any other filtering, but does wish to mask when building the lookup tables
         then one should specify:

         -F "m"

         This is the only case where "m" should be specified alone.


PSI-Blast
---------

The blastpgp program can do an iterative search in which
sequences found in one round of searching are used to build
a score model for the next round of searching. In this usage,
the program is called Position-Specific Iterated BLAST, or PSI-BLAST.
As explained in the accompanying paper, the BLAST algorithm is
not tied to a specific score matrix. Traditionally, it has been
implemented using an AxA substitution matrix where A is the alphabet size.
PSI-BLAST instead uses a QxA matrix, where Q is the length of the query
sequence; at each position the cost of a letter depends on the position
w.r.t. the query and the letter in the subject sequence.

The position-specific matrix for round i 1 is built from a constrained
multiple alignment among the query and the sequences found with
sufficiently low e-value in round i.  The top part of the output for
each round distinguishes the sequences into: sequences found
previously and used in the score model, and sequences not used in the
score model. The output currently includes lots of diagnostics
requested by users at NCBI. To skip quickly from the output of
one round to the next, search for the string "producing", which is
part of the header for each round and likely does not appear elsewhere
in the output. PSI-BLAST "converges" and stops if all sequences
found at round i 1 below the e-value threshold were already in
the model at the beginning of the round.

There are several blastpgp parameters specifically for PSI-BLAST:
-j   is the maximum number of rounds (default 1; i.e., regular BLAST)
-h   is the e-value threshold for including sequences in the
     score matrix model (default 0.001)
-c   is the "constant" used in the pseudocount formula specified in the
     paper (default 10)

The -C and -R flags provide a "checkpointing" facility whereby
a score model can be stored and later reused.
   -C  stores the query and frequency count ratio matrix in a
                  file
   -R  restarts from a file stored previously.
When using -R, it is required that the query specified on the command line
match exactly the query in the restart file.
The checkpoint files are stored in a byte-encoded (not human readable)
format, so as to prevent roundoff error between writing and reading
the checkpoint.
Users who also develop their own sequence analysis software may wish
to develop their own scoring systems. For this purpose the code
in posit.c that writes out the checkpoint can be easily adapated to
write out scoring systems derived by other algorithms in such
a way that PSI-BLAST can read the files in later.
The checkpoint structure is general in the sense that it can handle
any position-specific matrix that fits in the Karlin-Altschul
statistical framework for BLAST scoring.

The -B flag provides a way to jump start PSI-BLAST from a master-slave
multiple alignment computed outside PSI-BLAST.  The multiple alignment
must include the query sequence as one of the sequences, but it need
not be the first sequence.  The multiple alignment must be specified
in a format that is derived from Clustal, but without some headers and
trailers.  See example below. The rules are also described by the
following words.  Suppose the multiple alignments has N sequences.  It
may be presented in 1 or more blocks, where each block presents a
range of columns from the multiple alignment.  E.g., the first block
might have columns 1-60, the second block might have columns 61-95,
the third block might have columns 96-128. Each block should have N
rows, 1 row per sequence.  The sequences should be in the same order
in every block.  Blocks are separated by 1 or more blank lines.
Within a block there are no blank lines, and each line consists of 1
sequence identifier followed by some white space followed by
characters (and gaps) for that sequence in the multiple alignment.  In
each column, all letters must be in upper case, or all letters must be
in lower case.  Upper case means that this column is to be given
position-specific scores. Lower-case means to use the underlying
matrix (specified by -M) for this column; e.g., if the query sequence
has an 'l' residue in the column, then the standard scores for
matching an L are used in the column.

A sample usage would be:

  blastpgp -i seq1 -B align1 -j 2 -d nr

where seq1 is the query
      align1 is the alignment file
      -j 2 indicates to do 2 rounds
      -d nr indicates to use the nr database

The example files
    seq1
    align1
copied below were kindly supplied by L. Aravind from a paper
he and Chris Ponting published in Protein Science:

Aravind L, Ponting CP, Homologues of 26S proteasome subunits 
are regulators of transcription and translation, Protein Science 
7(1998) 1250-1254.

L. Aravind ([email protected]) was the first user
and helped define how -B should work. Y. Wolf ([email protected])
helped design a more flexible input format for the alignments.
If you like how -B works, let them know.
If you do not like how -B works, complain to 
A. Schaffer([email protected]) who did the implementation.

seq1
----
> 26SPS9_Hs 
IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA
LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL
SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP


align1
------
26SPS9_Hs     IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr
F57B9_Ce      LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk
YDL097c_Sc    ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae
YMJ5_Ce       LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr
FUS6_ARATH    KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld
COS41.8_Ci    SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet
644879        KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky
YPR108w_Sc    IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl
eif-3p110_Hs  SKAMKMGDWKTCHSFIINEKMNGkvw-------------------------------------------------------
T23D8.4_Ce    SKAMLNGDWKKCQDYIVNDKMNQkvw-------------------------------------------------------
YD95_Sp       IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl
KIAA0107_Hs   LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec
F49C12.8_Hs   LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl
Int-6_Mm      KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih

26SPS9_Hs     sladfekaltdy-----------------------------------------------------------------------------------
F57B9_Ce      rslkdfqvafgsf----------------------------------------------------------------------------------
YDL097c_Sc    aynnrslldfntalkqy------------------------------------------------------------------------------
YMJ5_Ce       krslkdfvkalaeh---------------------------------------------------------------------------------
FUS6_ARATH    vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------
COS41.8_Ci    eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar--------------------------------
644879        kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem----------
YPR108w_Sc    yasdyasyfpyllety-------------------------------------------------------------------------------
eif-3p110_Hs  -----------------------------------------------------------------------------------------------
T23D8.4_Ce    -----------------------------------------------------------------------------------------------
YD95_Sp       ylcdysgffrtladve-------------------------------------------------------------------------------
KIAA0107_Hs   rysvffqslavv-----------------------------------------------------------------------------------
F49C12.8_Hs   esyydchydrffiqlaale----------------------------------------------------------------------------
Int-6_Mm      wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk

26SPS9_Hs     ----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
F57B9_Ce      ----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV
YDL097c_Sc    ----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN
YMJ5_Ce       ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD
FUS6_ARATH    ----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ
COS41.8_Ci    ----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET
644879        ----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ
YPR108w_Sc    ----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN
eif-3p110_Hs  ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP
T23D8.4_Ce    ----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP
YD95_Sp       ----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE
KIAA0107_Hs   ----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS
F49C12.8_Hs   ----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS
Int-6_Mm      lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV





PHI-Blast
---------

PHI-BLAST (Pattern-Hit Initiated BLAST) is a search
program that combines matching of regular expressions
with local alignments surrounding the match.
The most important features of the program have been
incorporated into the BLAST software framework
partly for user convenience and partly so that
PHI-BLAST may be combined seamlessly with PSI-BLAST.
Other features that do not fit into the BLAST framework
will be released later as a separate program and/or
separate Web page query options.

One very restrictive way to identify protein motifs
is by regular expressions that must contain each instance
of the motif. The PROSITE database is a compilation of
restricted regular expressions that describe protein motifs.
Given a protein sequence S and a regular expression pattern P
occurring in S, PHI-BLAST helps answer the question:
What other protein sequences both contain an occurrence of P
and are homologous to S in the vicinity of the pattern occurrences?
PHI-BLAST may be preferable to just searching for pattern occurrences
because it filters out those cases where the pattern occurrence is
probably random and not indicative of homology.
PHI-BLAST may be preferable to other flavors of BLAST because
it is faster and because it allows the user to express
a rigid pattern occurrence requirement.

The pattern search methods in PHI-BLAST are based on the
algorithms in:

R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82.
S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91.

The calculation of local alignments is done using a method
very similar to (and much of the same code as) gapped BLAST.
However, the method of evaluating statistical significance is different, and
is described below.

In the stand-alone mode the typical PHI-BLAST usage looks like:
  blastpgp -i  -k  -p patseedp

  where -i is followed by the file containing the query in FASTA format
  where -k is followed by the file containing the pattern in a syntax given below
  and "patseedp" indicates the mode of usage,  not representing any file.

The syntax for the query sequence is FASTA format as for all other
BLAST queries. The syntax for patterns follows the rules of
PROSITE and is documented in detail below.
The specified pattern is not required to be in the PROSITE list.
Most of the other BLAST flags can be used with PHI-BLAST.
One important exception is that PHI-BLAST requires gapped
alignments (i.e. forbids -g F in the flags) because ungapped
alignments do not make sense for almost all patterns in PROSITE.

There is a second mode of PHI-BLAST usage that is important when
the specified pattern occurs more than 1 time in the query.
In this case, the user may be interested in restricting the
search for local alignments to a subset of the pattern occurrences.
This can be done with a search that looks like:
   blastpgp -i  -k  -p seedp

in which case the use of the "seedp" option requires the user to
specify the location(s) of the interesting pattern occurrence(s)
in the pattern file. The syntax for how to specify pattern
occurrences is below. When there are multiple pattern occurrences in the
query it may be important to decide how many are of interest because
the E-value for matches is effectively multiplied by the number
of interesting pattern occurrences.

The PHI-BLAST Web page supports only the "patseedp" option.

PHI-BLAST is integrated with PSI-BLAST. In the command-line
mode, PSI-BLAST can be invoked by using the -j option, as usual.
When this is done as:
   blastpgp -i  -k  -p patseedp -j

then the first round of searching uses PHI-BLAST and all subsequent
rounds use PSI-BLAST.
In the Web page setting, the user must explicitly invoke one round
at a time, and the PHI-BLAST Web page provides the option to
initiate a PSI-BLAST round with the PHI-BLAST results.
To describe a combined usage, use the term "PHI-PSI-BLAST"
(Pattern-Hit Initiated, Position-Specific Iterated BLAST).

Determining statistical significance.

When a query sequence Q matches a database sequence D in PHI-BLAST,
it is useful to subdivide Q and D into 3 disjoint pieces
    Qleft Qpattern Qright
    Dleft Dpattern Dright

The substrings Qpattern and Dpattern contain the pattern specified
in the pattern file. The pieces Qpattern and Dpattern are aligned
and that alignment is displayed as part of the PHI-BLAST output,
but the score for that alignment is mostly ignored.
The "reduced" score r of an alignment is the sum of the scores obtained
by aligning  Qleft with Dleft and by aligning Qright with Dright.

The expected number of alignments with a reduced score >= x
is given by:
       CN(Lambda*x   1)e^(-Lambda *x)
where:

C and Lambda are "constants" depending on the score matrix and the
gap costs.
N is (number of occurrences of pattern in database) * (number of
      occurrences of pattern in Q)
e is the base of the natural logarithm.

It is important to understand that this method of computing
the statistical significance of a PHI-BLAST alignment is mathematically
different from the method used for BLAST and PSI-BLAST alignments.
However, both methods provide E-values, so they the E_values are
displayed with a similar output syntax.

Rules for pattern syntax for PHI-BLAST.

The syntax for patterns in PHI-BLAST follows the conventions
of PROSITE. When using the stand-alone program, it
is permissible to have multiple patterns in a file separated
by a blank line between patterns. When using the Web-page
only one pattern is allowed per query.

Valid protein characters for PHI-BLAST patterns:
    ABCDEFGHIKLMNPQRSTVWXYZU

Valid DNA characters for PHI-BLAST patterns:
    ACGT

Other useful delimiters:
    [ ]    means any one of the characters enclosed in the brackets
        e.g., [LFYT] means one occurrence of L or F or Y or T
    -      means nothing (this is a spacer character used by PROSITE)
    x with nothing following means any residue
    x(5)  means 5 positions in which any residue is allowed (and similarly for any other
          single number in parentheses after x)
    x(2,4) means 2 to 4 positions where any residue is allowed,
           and similarly for any other two numbers separated by a comma;
           the first number should be < the second number.
    >      can occur only at the end of a pattern and means nothing
           it may occur before a period
           (another spacer used by PROSITE)

    .      may be used at the end of the pattern and means nothing

When using the stand-alone program, the pattern should
be in a file, with the first line starting:
 ID
followed by 2 spaces and a text string giving the pattern a name.

There should also be a line starting
 PA
followed by 2 spaces followed by the pattern description.

All other PROSITE codes in the first two columns are allowed,
but only the HI code, described below is relevant to PHI-BLAST.

Here is an example from PROSITE.

ID   CNMP_BINDING_2; PATTERN.
AC   PS00889;
DT   OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE   Cyclic nucleotide-binding domain signature 2.
PA   [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR   /RELEASE=32,49340;
NR   /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR   /FALSE_NEG=1; /PARTIAL=1;
CC   /TAXO-RANGE=??EP?; /MAX-REPEAT=2;

The line starting
    ID
gives the pattern a name.
The lines starting
     AC, DT, DE, NR, NR, CC
are relevant to PROSITE users, but irrelevant to PHI-BLAST.
These lines are tolerated, but ignored by PHI-BLAST.

The line starting
     PA
describes the pattern as:
      one of LIVMF
followed by
      G
followed by
      E
followed by
      any single character
followed by
      one of GAS
followed by
      one of LIVM
followed by
      any 5 to 11 characters
followed by
      R
followed by
      one of STAQ
followed by
      A
followed by
      any single character
followed by
      one of LIVMA
followed by
      any single character
followed by
      one of STACV

In this case the pattern ends with a period.
It can end with nothing after the last specifying symbol
or any number of > signs or periods or combination thereof.

Here is another example, illustrating the use of an HI line.

ID    ER_TARGET; PATTERN.
PA  [KRHQSA]-[DENQ]-E-L>.
HI (19 22)
HI (201 204)

In this example, the HI lines specify that the pattern
occurs twice, once from positions 19 through 22 in the
sequence and once from positions 201 through 204 in the
sequence.
These specifications are relevant when stand-alone PHI-BLAST is
used with the
     seedp
option, in which the interesting occurrences of the pattern
in the sequence are specified. In this case the
HI lines specify which occurrence(s) of the pattern
should be used to find good alignments.

In general, the seedp option is more useful than the
standard patternp option ONLY when the
pattern occurs K > 1 times in the sequence AND
the user is interested in matching to J < K of those
occurrences.
Then using the HI lines enables the user to specify which
occurrences are of interest.

Additional functionality related to PHI-BLAST.

PHI-BLAST takes as input both a sequence and a query containing
that sequence and searches a sequence database for
other sequences containing the same pattern and having a good alignment.
One may be interested in asking two related, simpler questions:

1. Given a sequence and a database of patterns, which patterns occur
in the sequence and where?

2. Given a pattern and a sequence database, which sequences contain the
pattern and where?

These queries can be answered wih software closely related to PHI-BLAST,
but they do not fit into the output framework of BLAST because the
answers are simple lists without alignments and with no notion of
statistical significance.

The NCBI toolbox includes another program, currently called
     seedtop
to answer the two queries above.

Query 1 can be asked with:
  seedtop -i  -k  -p patmatchp

Query 2 can be asked with:
  seedtop -d  -k  -p patternp

The -k argument is used similarly in all queries and the file
format is always the same. The standard pattern database is
PROSITE, but others (or a subset) can be used.
There are plans afoot to offer the patmatchp query (number 1) on
the PHI-BLAST web page or in its vicinity, but this would
be restricted to having PROSITE as the pattern database.

References

     Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L. Madden,
     David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998),
     "Protein sequence similarity searches using patterns as seeds", Nucleic
     Acids Res. 26:3986-3990.

     Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
     Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
     "Gapped BLAST and PSI-BLAST: a new generation of protein database
     search programs", Nucleic Acids Res. 25:3389-3402.

     Karlin, Samuel and Stephen F. Altschul (1990).  Methods  for
     assessing the statistical significance of molecular sequence
     features by using general scoring schemes. Proc. Natl. Acad.
     Sci. USA 87:2264-68.

     Karlin, Samuel and Stephen F. Altschul (1993).  Applications
     and statistics for multiple high-scoring segments in molecu-
     lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.

     Schaffer, Alejandro A., L. Aravind, Thomas L. Madden, Sergei Shavirin
     John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
     Improving PSI-BLAST Protein Database Search Sensitivity with Composition-Based 
     Statistics and Other Refinements.  Nucleic Acids Res. 29:2994-3005.

Release History
---------------

Notes for 2.2.4 release (08/26/02):

Enhancements:

1.) Discontiguous word matching is now available for megablast.
See http://www.ncbi.nlm.nih.gov/blast/discontiguous.html for details.

2.) An out-of-frame gapping option (meaning that one or two bases can be
inserted or deleted from an alignment) is now available in blastall for
blastx and tblastn.  NOTE that the expect values have been calculated
assuming in-frame gapping (three bases inserted/deleted) and should only
be used for guidance.

3.) Fastacmd can now dump out partial sequences (using the -L option) 
and print taxonomic information for a sequence.

Bug fixes:

1.) A problem that caused blastall to core-dump when the -U option
(mask the sequence that is lower-case in input file) has been fixed.

2.) A problem that caused bl2seq to not work properly for protein-protein
searches with BLOSUM62 (on some platforms) has been fixed.

3.) A problem that caused seedtop to core dump if there were a lot
of hits has been fixed.

4.) Using -n with blastall (megablast mode) now returns the same results
as default megablast.

5.) XML output for megablast has been fixed.

6.) A problem with translating rpsblast that caused it to crash on OSF/1
and report incorrect values on other platforms has been fixed.

7.) A memory leak in formatdb was fixed.

8.) A problem that caused blastpgp to core-dump when running in PHI-BLAST mode
(if many hits were found) was fixed.  Memory leaks were also fixed.

9.) the double closing of a file that caused phi-blast to crash occassionally
under LINUX has been fixed.


Notes for 2.2.3 release (04/24/02):

Enhancements:

1.) Version 4 of the BLAST databases is now the default for formatdb.  
This can be overridden for older binaries by use of "-A F" on the
command-line.


Bug fixes:

1.) A problem has been fixed that caused tblastn searches to miss some protein matches,
if the database sequence was longer than 15 million bases.

2.) Selenocysteine residues (U) in the query are now replaced by X's as these
are not supported in the currently available matrices (e.g., BLOSUM62), so that
their presence occasionally caused data corruption.

3.) A problem with combining the "-m 7" and "-n T" options in blastall has 
been fixed.

4.) XML output had a <Hit_def> field that could (incorrectly) have an empty value,
this has been fixed.

5.) A problem with reading databases with more than one volume and an oidlist has
been fixed.

6.) A problem with ungapped XML output that caused all HSP's to be number zero
has been resolved, they are now numbered with one-offset.

7.) A bug that prevented use of some matrices for ungapped searches has
been fixed.

8.) Effective query and database lengths were calculated incorrectly for 
rpsblast, leading to a minor change in expect values in some cases.  This
has been corrected.

9.) A for loop that could overrun the end of a buffer during formatting was
fixed.  Many thanks to Haruna Cofer of SGI for pointing this.

10.) The effective database length command-line argument (-z) has been fixed
for blastall and megablast.  The parser was reading digits only until
there were no non-digits (e.g., 1.6e8 was interpreted as "1"), leading
to wildly incorrect effective database lengths.  This has been fixed so 
that 160000000 and 1.6e8 are interpeted the same way. 


Notes for 2.2.2 release:

Enhancements:

1.) Version 4 of the BLAST databases is now fully supported.  This version
has some enhancements described in README.formatdb and fixes some problems
described below.  Use the "-A" option on formatdb to produce the new database
version.  The BLAST binaries for release 2.2.2 are entirely compatiable with
both the current and the new version of the BLAST databases.  Old BLAST binaries
are not necessarily compatiable with the new database format.

2.) Fastacmd will dump out an entire BLAST database in FASTA format if the
new -D option is used.

3.) Fastacmd will separate definition lines from different GI's that have
been merged together in nr (as they all have the same sequence) by control-A's.
if the new -c option is used.


Bug fixes:

1.) A problem has been fixed that caused tblastn searches to miss some protein matches,
if the database sequence was longer than 15 million bases.

2.) The old (current) version of the BLAST databases has a "rollover" problem if
the total number of bases in a single volume is greater than 4294967295.  The new
database verison (#4) allows eight bytes for this.

3.) The old (current) version of the BLAST database format does not handle ambiguity
characters in a nucleotide database sequence if it is over 16 million characters long.
The new version of the the BLAST database does.

4.) A performance problem that caused a mutexes to be acquired too often for 
multi-threaded runs with four or more CPU's has been fixed.  Thanks to Haruna
Cofer of SGI for help in finding the cause.

5.) A problem that caused ungapped blastp/blastx/tblastn/tblastx to crash on
certain matrices (e.g., pam10) has been fixed.

6.) Some blastpgp problems with using the -B (for reading a master-slave alignment) and
reading checkpoint files (-C) have been resolved.


Notes for 2.2.1 release:

Enhancements:

1.) BLAST and PSI-BLAST improvements as described in 
Schaffer et al., Nucleic Acids Research 2001 Jul 15;29(14):2994-3005.
These include improvements the use of composition-based statistics
and improvements to the edge-correction effects.  Composition-based 
statistics were initially implemented in release 2.1.1, but the 
implementation is improved in release 2.2.1.

2.) Formatdb automatically produces database volumes for input
consisting of more than 4 billion letters.

3.) Formatdb can produce an alias file for a given database and GI list
as well as convert a GI list to the more efficient binary format.  See
details in README.formatdb.

4.) RPSBLAST now works properly with 'scaled' databases.  The scaling factor must
be set when executing the program 'makemat' (which takes PSI-BLAST checkpoints
as input).  Scaling-up the matrix improves the precision of the (integer) calculations.

5.) Tabular output has now been added to blastpgp and rpsblast, use the "-m 8" option.

6.) Blastpgp will now process multiple queries.

Bug fixes:

1.) A problem with the -K option (for culling) that caused BLAST to crash has been fixed.

2.) A problem with the "gnl" identifier and multi-volume databases has been fixed.

3.) A problem that caused BLASTN to very rarely find suboptimal alignments has been fixed.

4.) A problem that could cause makemat to crash has been fixed.

4.) Some multi-threading problem pointed out by Henry Gabb of KAI were fixed.

5.) Some PC-lint errors and warnings pointed out by Russ Williams of United Devices
were fixed.


Notes for 2.1.3 release:

Enhancements:

1.) Addition of PSI-TBLASTN ability to blastall, see description in 
README.bls.

2.) Database sequences over 5 million bases in length are now broken
into chunks to keep memory usage reasonable.

3.) Blastall now allows one to enter a location if it is desired
to search a subsequence of the query.

4.) Formatdb can produce a new BLAST database format using the -A option.
The BLAST programs can read this format as well as the current format (the
program automatically identifies which version it should work with).  This 
new format stores the sequence definition lines in a structured manner
(as ASN.1), this will allow future versions of BLAST to better present
taxonomic information as well as information about other resources (e.g., 
UniGene, LocusLink) for a database sequence.  

5.) Blastall can now produce tab-delimited, use "-m 8" to specify this.

6.) Improved Karlin-Altschul parameters are now being used, they were 
calculated using the "island" method

7.) A "gapped" check was added to BLASTN to ensure that if a hit is low-scoring
after an ungapped extension, but high-scoring after a gapped extension, it will 
not be missed.

8.) The formatdb error messages have been improved for the case of illegal
characters in the sequence.

9.) The number of HSP's saved in an ungapped search has been increased to 400 from 200.

Bug fixes:

1.) A problem with XML output was fixed.

2.) A problem with the seg filtering under LINUX was
fixed (many thanks to Eric Cabot at GCG for pointing this out).

3.) A problem with format of BLAST reports if the "-o" flag
was not used when the database was produced was fixed 
(thanks again to Eric Cabot).

4.) A problem with reading the BLAST database caused by a 4-byte signed integer 
than should have been unsigned was fixed (thanks to Haruna Cofer at SGI
for pointing this out).

5.) A problem with copymat under NT and IRIX was fixed.


Notes for 2.1.2 release:

Enhancements:

1.) Release of rpsblast.  Rpsblast performs a search against a database
of profiles.  See README.rps for full details.

2.) Release of blastclust.  BLASTCLUST automatically and systematically clusters protein sequences
based on pairwise matches found using the BLAST algorithm.   See README.bcl for
full details.

3.) Release of megablast.  Megablast uses the greeedy algorithm of Webb Miller et al. 
for nucleotide sequence alignment search and concatenates many queries to save
time spent scanning the database.   See README.mbl for full details.

4.) XML output can now be produced.  Use the '-m 7' option for this.
The XML output is still experimental.  

5.) the default behavior the culling (-K) option has been changed.  Previously
this option was set to 100, meaning that if more than 100 HSP's had a
hit to a region lower scoring ones would be dropped.  The option is now
zero, which turns off this behavior.  In a few cases this change will
result in more database sequences being reported.  The previous behavior can
be recovered by using '-K 100' on the command-line.

Bug fixes:

1.) A bug that caused only the last SeqAnnot to be written (if the -O option
was used) when multiple sequences were searched has been fixed.  All
SeqAnnots are printed out.

2.) A bug that caused the search space (set on the command line with the -Y option)
to be ignored for some blastx and tblastn calculations has been fixed.

3.) A failure to close a file if a gilst was used (using the -l option) was
fixed.  Many thanks to David Mathog at CalTech for spotting this problem
and suggesting a fix.

4.) A bug that caused all the database names listed in an alias file to be
printed, rather than the "TITLE" field has been fixed.



Notes for 2.1.1:

Enhancements:

1.) Addition of compostion-based statistics:

BLAST and PSI-BLAST now permit calculated E-values to take into account the amino acid composition of the individual database sequences involved in reported
alignments. This improves E-value accuracy, thereby reducing the number of false positive results. 

The improved statistics are achieved with a scaling procedure [1,2] which in effect employs a slightly different scoring system for each database sequence. As a result,
raw BLAST alignment scores in general will not correspond precisely to those implied by any standard substitution matrix. Furthermore, identical alignments can receive
different scores, based upon the compositions of the sequences they involve. The improved statistics are now used by default for all rounds of searching on the
PSI-BLAST page, but not on the BLAST page. Therefore, if one uses default settings, the results of the first round of searching will be different on the BLAST and
PSI-BLAST pages. 

In addition adjustments have been made to two PSI-BLAST parameters: the pseudocount constant default has been changed from 10 to 7, and the E-value threshold for
including matches in the PSI-BLAST model has been changed from 0.001 to 0.002. 

1. Altschul, S.F. et al. (1997) Nucl. Acids Res. 25:3389-3402.
2. Schäffer, A.A. et al. (1999) Bioinformatics 15:1000-1011. 


Notes for 2.0.14 release:


Bug fixes:

1.) extra line returns between sequences in the a FASTA file 
causes formatdb to produce corrupted databases.

2.) ";" at the beginning of a line was not being treated as a comment.

3.) a problem with the formatter causes blast to core-dump if
the FASTA definition line only contains an identifier and
no description.

4.) a problem in the ungapped extension for protein sequences
causes a rare problem.

5.) the '-U' option that causes lower-case sequence to be masked
does not work correctly for blastx.


Notes for 2.0.13 release:

Enhancements:

1.) The output format for pairwise alignments was changed to
put each new gi (if the sequence has redundant gi's) on a
new line.  If HTML output is specified then each gi is hyperlinked.

Bug fixes:

1.) An NCBI toolkit problem parsing the new RefSeq format in FASTA files
(two bars instead of three) was fixed.  This fix applies to all
BLAST binaries (formatdb, blastall, blastpgp, etc.).

2.) A problem that caused BLAST version 2.0.12 under NT to freeze in
multithreaded mode has been fixed.

Notes for 2.0.12 release:

Enhancements:

1.) Bl2seq can now perform nucleotide-protein (blastx style) comparisons.
This necessitated changing the '-p' option from a Boolean to a
string.  Valid arguments are "blastn", "blastp", or "blastx".

Bug fixes:

1.) A problem in the NCBI threads library that caused BLAST to sometimes
stick was corrected.  Many thanks to Haruna Cofer and colleauges at SGI
for providing a fix.

2.) A problem that caused BLAST to core-dump (especially on long queries)
has been fixed.  Many thanks to Gary Williams for providing examples.

3.) A problem that prevented the search of multiple multivolume databases
has been fixed.  



Notes for 2.0.11 release:

Enhancements:

1.) Optimizations were contributed by Chris Joerg of COMPAQ.  These changes
reduce the number of cache misses, unroll loops, and make some instructions
unnecessary.  These improvements can speed up BLAST for long sequences
several-fold.

2.) A database is now only memory-mapped while being searched.  If multiple databases
are searched and the total exceeds the allowed memory-map limit this allows 
all databases to be searched as memory-mapped files.  If a database cannot
be memory-mapped it is read as an ordinary file, rather than causing an error.

Bug fixes:

1.) Formatdb was fixed to correct a problem with FASTA string identifiers under NT.

2.) Blastpgp was fixed to prevent a core-dump under LINUX

3.) BLASTN was found to miss some hits near the expect value cutoff.  This has been
corrected.



Notes for 2.0.10 release:

Enhancements:

1.) Bl2seq, a utility to compare two sequences using the blastn or blastp approach,
is included in the archive.  See the full description in the README.bls for details.

2.) A 'sparse' option ('-s') has been added to formatdb.  This option limits the indices
for the string identifiers (used by formatdb) to accessions (i.e., no locus names).
This is especially useful for sequences sets like the EST's where the accession and locus
names are identical.  Formatdb runs faster and produces smaller temporary files if this
option is used.  It is strongly recommended for EST's, STS's, GSS's, and HTGS's.

3.) A volume option ('-v') has been added to formatdb.  This option breaks up large
FASTA files into 'volumes' (each with a maximum size of 2 billion letters).
As part of the creation of a volume formatdb writes a new type of BLAST database file,
called an alias file, with the extension 'nal' or 'pal', is written.  This option
should be used if one wishes to formatdb large databases (e.g., over 2 billion 
base pairs).

4.) It is is now possible to jump start the command line version of PSI-BLAST (blastpgp) 
from a multiple alignment that includes the query sequence using the -B option. Details 
are in README.bls.

5.) The maximum wordsize limit for BLASTN has been removed.

Bug fixes:

1.) A problem if the database length, set by the '-z' option was greater than
2 billion, was fixed.

2.) A core-dump that resulted from the use of the coil-coil masking
('-F C') was fixed by including a file needed for the data directory.

3.) A bug was fixed that caused some very short alignments to be assigned incorrect 
expect values. 

4.) A bug was fixed that caused formatdb to produce incorrect BLAST databases if
the input was ASN.1.

5.) A serious performance problem with BLASTN and longer words (greater than 16)
was fixed.

Notes for 2.0.9 release:

Enhancements:

1.) two new options have been added to blastall: to produce output in HTML and 
to search a subset of the database based upon a list of GI's.  Please see 
the options section for full information.  

2.) two new options have been added to blastpgp: to produce HTML output and to
produce an ASCII version of the PSI-BLAST Matrix.  Please see the options section
for more information.

3.) formatdb has a new option to allow specification of a 'base' name.  see the options
section for full details.

4.) it is possible to mask only during the phase when the lookup table is being built, 
but not during the extensions.  See the options section for full details.

Bug fixes:

1.) a problem that occurred when too many HSP's aligned to the same part
of the query from one database sequence has been fixed.

2.) a problem that caused seedtop to not perform pattern-matching for DNA
sequences has been fixed.

3.) the number of HSP's saved for ungapped BLAST and tblastx is now limited to
200 to prevent problems with memory and speed.

4.) a missing thread join that caused problems under DEC Alpha has been added.

5.) a formatting problem with the database summary at the beginning of the
BLAST output (if multiple databases totaling over 2 Gig) has been fixed.

6.) a bug in formatdb that caused a core-dump if the total number of sequences was an
exact multiple of 100000 was fixed.


Notes for 2.0.8 release:

Enhancements:

1.) Frame and strand information was added to the output.  Examples of the
new output format may be found at http://www.ncbi.nlm.nih.gov/BLAST/example.html.

2.) An option that specifes the query strand to be searched (for blastn, blastx, and tblastx)
has been added.  The option is '-S'.

Bug fixes:

1.) The problem with the 'too-wide' parameter input screen under NT was fixed.

2.) BLAST no longer core-dump's when the query is NULL.

3.) BLAST no longer core-dump's when the query contains an '@' and blastx or tblastx is selected.

Notes for 2.0.7 release:

Bug fixes:

1.) BLAST now multi-threads properly under LINUX.

2.) A problem with very redundant databases and psi-blast was fixed.

3.) A problem with the formatting of the number of identities and positives
was fixed.  This affected results on the minus strand only and did not
affect the expect value or scores.

4.) A problem that caused tblastn to core-dump very occassionally was corrected.

5.) A problem with multiple patterns in PHI-BLAST was fixed.

6.) A limit on the number of HSP's that were saved (100) was removed.

Notes for 2.0.6 release:

Enhancements:

1.) PHI-BLAST is included in this release.  Please see notes on PHI-BLAST for
details.

2.) SEG has become an integral part of the NCBI toolkit and it is no longer necessary
to install it separately.  It is also now supported under non-UNIX platforms.

3.) Access to filtering options.

If one uses "-F T" then normal filtering by seg or dust (for blastn)
occurs (likewise "-F F" means no filtering whatsoever).  The seg options
can be changed by using:

-F "S 10 1.0 1.5"

which specifies a window of 10, locut of 1.0 and hicut of 1.5.  One may
also specify coiled-coiled filtering by specifying:

-F "C"

There are three parameters for this: window, cutoff (prob of a coil-coil), and
linker (distance between two coiled-coiled regions that should be linked
together).  These are now set to

window: 22
cutoff: 40.0
linker: 32

One may also change the coiled-coiled parameters in a manner analogous to
that of seg:

-F "C 28 40.0 32" will change the window to 28.

One may also run both seg and coiled-coiled together by using a ";":

-F "C;S"

4.) BLAST has been changed to reduce the number of redundant hits that a user
may see.  This is acheived by keeping track of the number of hits completely
contained in a certain region and eliminating those lower scoring hits that
are redundant with others.  This behavior may be controlled with the -K and -L
options:

  -K  Number of best hits from a region to keep [Integer]
    default = 50
  -L  Length of region used to judge hits [Integer]
    default = 20

Setting -K to zero turns off this feature.  This is the default only on blastall.

Bug fixes:

1.) There was a problem with the procedure that called the external utility seg.
The need to fix this was obviated by the integration of seg into the toolkit.
This showed up under LINUX.

2.) There was a memory problem with formatdb that has been fixed.  This showed up
mostly under NT and LINUX.

3.) A problem with running in multi-processing mode under IRIX6.5 (as a non-root user)
was fixed.

Notes for 2.0.5 release:

Enhancements:

1.) The BLAST version is printed by formatdb in it's log file.

2.) Multi-database searches no longer require that the -o option be used when
preparing the databases (i.e., with formatdb).

Bugs fixed:

1.) A serious bug with multi-database iterative searches was fixed (thanks to
Steve Brenner for providing an example).

2.) 'lcl' is not formatted in the BLAST report when the sequence identifier
is a local identifier or does not contain a bar ("|").

3.) A large memory leak in formatdb was fixed.

4.) An unnecessary cast that caused formatdb to fail on Solaris 2.5 machines
if the binary was made under 2.6 was fixed.

5.) Better error checking was added to protect against core-dumps.

6.) Some problems with the sum statistics treatment of the blastx and tblastn
programs reported by D. Rozenbaum were fixed.  The number of alignments
involved in a sum group was misrepresented.  Also the incorrect length for
the database sequence was used, sometimes casuing a slight change in the
value reported.

7.) A problem with blastpgp was fixed that reported incorrect values for
matrices other than BLOSUM62 during iterative searches.

Notes for 2.0.4 release:

Enhancements:

1.) multiple database searches:

Version 2.0.4 will accept multiple database names (bracketed by quotations).
An example would be

              -d "nr est"

which will search both the nr and est databases, presenting the results as if one
'virtual' database consisting of all the entries from both were searched.   The
statistics are based on the 'virtual' database.

2.) new options:

  -W  Word size, default if zero [Integer]
    default = 0
  -z  Effective length of the database (use zero for the real size) [Integer]
    default = 0

3.) The number of identities, positives, and gaps are now printed out before the
alignments for gapped blastx, tblastn, and tblastx.  Additionally this feature is
now also enabled for ungapped BLAST.

4.) Formatdb now accepts ASN.1, as well as FASTA, as input.

Bugs fixed:

1.) In blastx, tblastn, and tblastx a codon was incorrectly formatted as a start codon in
some cases.

2.) The last alignment of the last sequence being presented was incorrectly dropped
in some cases.  This change could affect the statistical significance of the last database
sequence if the dropped alignment had a lower e-value than any other alignments from the
same database sequence.

About

Wrapper script that calls formatdb on nr database and then runs blastall against formatted db and parses the output.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published