Having a BLAST with bioinformatics (and avoiding BLASTphemy)
© BioMed Central Ltd 2001
Published: 27 September 2001
Searching for similarities between biological sequences is the principal means by which bioinformatics contributes to our understanding of biology. Of the various informatics tools developed to accomplish this task, the most widely used is BLAST, the basic local alignment search tool. This article discusses the principles, workings, applications and potential pitfalls of BLAST, focusing on the implementation developed at the National Center for Biotechnology Information.
Query sequence type
Target sequence type
Compares an amino acid query sequence against a protein sequence database
Compares a nucleotide query sequence against a nucleotide sequence database
Compares a nucleotide query sequence translated in all reading frames against a protein sequence database
Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database
What does sequence comparison measure? Similarity versus homology
In describing sequence comparisons, several different terms are commonly (mis)used: identity, similarity and homology. Even though they are often used interchangeably, they have quite different meanings. Sequence identity refers to the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences. Sequence similarity takes approximate matches into account, and is meaningful only when such substitutions are scored according to some measure of 'difference' or 'sameness' with conservative or highly probably substitutions assigned more favorable scores than non-conservative or unlikely ones. The term 'sequence homology' is the most important (and the most abused) of the three. When we say that sequence A has high homology to sequence B, then we are making two distinct claims: not only are we saying that sequences A and B look much the same, but also that all of their ancestors also looked the same, going all the way back to a common ancestor. Although the first of these claims is easily verified, the second is frequently in doubt. Although the comparison of two sequences is often summarized as a percentage sequence homology, that usage is generally incorrect as the value really indicates identity and/or similarity, and does not necessarily reflect an evolutionary relationship.
The discussion is not merely about terminology, however, but goes to the core of biology itself (see, for example, [9,10,11]). This point is beautifully articulated by David Wake in a 1994 book review : "Homology is the central concept for all of biology. Whenever we say that a mammalian hormone is the 'same' hormone as a fish hormone, that a human gene sequence is the 'same' as a sequence in a chimp or a mouse, that a HOX gene is the 'same' in a mouse, a fruit fly, a frog, and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about. But a deeper reflection shows that this confidence is based more on hope than on certainty." Sequence comparison algorithms such as BLAST and FASTA  (which employ heuristic algorithms to search a sequence database for the closest matches to a query sequence), and SSEARCH [13,14] (which does a full local alignment of each sequence pair by a dynamic programming method) do not measure sequence homology: they measure sequence similarity and identity. Inferences of homology can only be supplied by the user, a point reinforced by a recent letter to the editor of the Journal of Molecular Evolution entitled "The closest BLAST hit is often not the nearest neighbor." 
Why do we want to know how similar two sequences are? Because Nature has solved the same problem many times, sometimes with significant similarity among the solutions. This means that the identification of similarity between sequences saves us countless biologist-years by enabling us to assign information known about one sequence to other similar sequences.
Scoring metrics: statistical versus biological
When evaluating a sequence alignment, one would like to know how meaningful it is. This requires a scoring matrix, or a table of values that describes the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases at a given position are the same. All matches are given the same score (typically +1 or +5), as are all mismatches (typically -1 or -4). But with proteins the situation is different. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another, such as the chemical nature and frequency of occurrence of the amino acids. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). There are two major forces that drive the amino-acid substitution rates away from uniformity: not all substitutions occur with the same frequency, and some substitutions are less functionally tolerated than others and are therefore selected against.
The BLOSUM matrices (Figure 2b) were constructed in a similar manner, but from sequences that were selected to avoid frequently occurring, highly related sequences. The underlying data were derived from the BLOCKS database [19,20], which is a set of ungapped alignments of sequences from families of related proteins. Using about 2,000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, the sequences in each block were sorted into closely related clusters and the frequencies of substitutions between these clusters within a family used to calculate the probability of a meaningful substitution. The number associated with a BLOSUM matrix (such as BLOSUM62 or BLOSUM80) indicates the cutoff value for the percentage sequence identity that defines the clusters. Lower cutoff values allow more diverse sequences into the groups, and the corresponding matrices are therefore appropriate for examining more distant relationships.
BLAST-related web pages at NCBI
The statistics of sequence similarity scores (introduction to BLAST statistics)
BLAST frequently asked questions (FAQ)
BLAST information (tutorials)
BLAST ftp site - clients and databases
BLAST source code
BLAST parameters and options
(a) Parameters mentioned in the text and Box 2
Reward for a nucleotide match (BLASTNonly)
Penalty for a nucleotide mismatch (BLASTNonly)
Cost to open a gap (zero invokes default behavior)
Cost to extend a gap (zero invokes default behavior)
Filter query sequence
Word size; default length is used if set to zero
Effective length of the database (use zero to get the real size)
Expectation value (E)
(b) Additional useful parameters
Name of the query file
Alignment viewing options, which include:
0 Pairwise alignment
1 Query-anchored showing identities
2 Query-anchored, no identities
7 XML output
Name of the BLAST report output file
Threshold for extending hits; default is used if set to zero
Perform gapped alignment (not available with TBLASTX)
Effective length of the search space (use zero get the real size)
Query strands to search against the database (for BLAST [NX], and TBLASTX) 3 is both, 1 is top, 2 is bottom
Produce HTML output
Drop-off (X) for BLAST extensions, in bits (0.0 invokes default behavior)
X drop-off value for final gapped alignment (in bits)
Mutational events include not only substitutions but also insertions and deletions. The consequence with respect to sequence alignment and comparison is the need to introduce gaps into one or both sequences in order to produce a proper alignment. The penalty for the creation of a gap should be large enough that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time. For example, some protein structural elements tend to evolve as a unit, but entire elements may move relative to one another. Affine gap penalties, which impose an 'opening' penalty for a gap and an 'extension' penalty that decreases the relative penalty for each additional position in an already opened gap, address both of these issues.
NCBI's BLAST page  allows one to choose from several different sets of parameters for scoring gaps (existence penalties of 7, 8, and 9 with an extension penalty of 2, and existence penalties of 10,11 and 12 with an extension penalty of 1). These values can be adjusted with the -G and -E flags in the stand-alone version (See Table 3 for further details of BLAST parameters and options).
The need for an automated way of finding the optimal alignment out of the numerous alternatives is clear, but the method must be consistent and biologically meaningful. "What sounds simple in principle isn't at all simple in practice. Choosing a good alignment by eye is possible, but life is too short to do it more than once or twice."  To guarantee that you have the best alignment, many (but not all possible) alignments must be generated and evaluated. For two long sequences, doing this directly would take a considerable amount of time, even on the fastest computers. Examining the calculations in detail, however, one might notice that the vast majority of the time would be spent evaluating the same portions of the candidate alignments many times over. This redundant aspect of sequence comparison makes it amenable to a time-saving shortcut called dynamic programming.
Dynamic programming methods were first described in the 1950s, outside the context of bioinformatics, and first applied in this context by Needleman and Wunsch in 1970 . These methods find an optimal solution to a given problem by breaking the original problem into smaller and smaller subproblems until the subproblems have a trivial solution, and then using those solutions to construct solutions for larger and larger portions of the original problem. In sequence comparison, the overall problem is determining the optimal alignment of two sequences. This is broken down into smaller and smaller alignments of parts of one sequence with parts of another sequence to the smallest case, which is the alignment of a single residue from one sequence with a single residue from the other sequence. This solution to this smallest subproblem is known, and is taken from the scoring matrix.
A generalization of the recursive dynamic programming approach, the Smith-Waterman algorithm  is an exhaustive, mathematically optimal method, which handles sequence comparisons in a single computation and is guaranteed to find the highest scoring alignment. The algorithm incorporates the concepts of mismatches and gaps, and identifies optimal local alignments. Local alignments, where parts of one sequence are aligned to parts of another are more biologically relevant than global alignments where entire sequences are aligned to each other, because long regions of high similarity are the exception, rather than the rule, for most biological applications.
Heuristics: sensitivity versus speed
As fast as computers are, and as efficient as the dynamic programming algorithms are, they are still far too slow to enable exhaustive searches of huge sequence repositories such as GenBank [24,25] or SWISS-PROT [26,27]. An exhaustive search of GenBank is still beyond the reach of most researchers' computer power - and with the growth of sequence databases outstripping increases in computation speed, this situation is not going to get better any time soon. This is where BLAST comes in. There are two primary methods for taking even shorter shortcuts by approximating the best local alignment: FASTA and BLAST. Neither is guaranteed to find the best local alignment, but they almost always do. As outlined above, this discussion will focus on BLAST.
BLAST and FASTA are similar in that both operate on the assumption that true matches are likely to have at least some short stretches of high-scoring similarity, but where FASTA looks for exactly matching 'words' (strings of residues), BLAST uses a scoring matrix - BLOSUM62 for amino-acid sequences, by default - to find words that may not match exactly but are high-scoring nevertheless. These high-scoring 'hits' are used as 'seeds' for the slower, more sophisticated dynamic programming algorithm. BLAST also performs some pre-processing of the query sequence - to filter out low-complexity regions (such as CA repeats) and to discard words not likely to form high-scoring pairs. Like FASTA, BLAST does not allow gaps in the primary word-matching pass, but it does in the subsequent Smith-Waterman alignment stage. For this reason, BLAST, like FASTA, has the potential to miss significant similarities present in the database . From a practical standpoint, BLAST is generally the way to go, not only because of its better accuracy, but also because of its availability and its wide acceptance as the standard.
What BLAST does and how it does it
In step 1, BLAST filters low complexity regions (CA repeats, for example) and removes them from the query sequence. Low compositional complexity or short-periodicity repeats can yield extremely large numbers of statistically significant but biologically uninteresting results. The filtering and removal of these can be controlled with the -F flag of the stand-alone version of BLAST and with check boxes in the web version. Next, BLAST generates a list of all of short sequences, or words, that make up the query (Figure 3a). The default word lengths are 3 and 11, for amino-acid sequences and nucleotide sequences, respectively, and are adjustable using the -W flag in the stand-alone version. Then, BLAST uses a scoring matrix (BLOSUM62, by default, for amino acids) to determine all high-scoring matching words for each word in the query sequence. No gaps are allowed. The list of matches is reduced by taking only those that will score above a given threshold, called the neighborhood word-score threshold. There is a trade-off at this stage between speed and sensitivity: a higher threshold gives greater speed but increases the chance of missing relevant pairs. Approximately 50 of these matches are usually kept for each of the words generated from the original query.
In the second step, BLAST searches through the target sequence database for exact matches to the word list generated (Figure 3b). Because BLAST has already pre-processed and indexed the databases for the occurrence of all words in each sequence in the database, this search is extremely fast. If a match is found, it is used to seed a possible alignment between the query and the database sequences.
In the third step, the original BLAST method tried to extend the alignment from the matching words in both directions as long as the score continued to increase (Figure 3c). The resulting alignment was called a high-scoring pair, or HSP. Gapped BLAST  uses a lower threshold for generating the list of high-scoring matching words; the algorithm uses short matched regions with no insertions or deletions between them and within a certain distance of each other as the starting points for longer ungapped alignments. These joined regions are then extended using the same method as in the original BLAST.
Always compare protein sequences if the query sequences encode proteins
Given that nucleotide and protein databases are not uniformly populated, nucleotide and amino-acid sequence comparisons should be used to complement each other. Despite the fact that protein databases tend to be more sparsely populated than nucleotide databases, the constraints of protein evolution - the fact that a protein folds into a functional structure - along with the redundancy of the genetic code, make protein sequence comparison a more powerful tool for inferring structure and function from sequence.
Pay close attention to the statistics
Although most sequences that share significant similarity are homologous, many homologous sequences do not share significant similarity. In addition, repetitive sequences violate certain assumptions made in the statistical theory that underlies BLAST. Ensure that matches are not simply due to biased amino-acid composition. Certain sequences, such as low-complexity regions, can display significant similarity when there is no significant homology. And keep in mind that similarity spread out over a whole domain is likely to be more biologically significant than short, nearly exact matches.
Avoid reporting raw BLAST scores in publications
The significance and meaning of raw BLAST scores depends on many things, so they are, at best, meaningless and may be deceptive. It is much better to show an alignment. Although normalized scores allow comparison of the results of searches using different scoring systems, they are an extreme reduction of the rich information available in an alignment. In addition, when reporting alignments, do not assume that the alignment that BLAST returns is the correct one.
Know the difference between sensitivity and selectivity
Similarity searching techniques can be improved either by increasing sensitivity - the ability of a method to recognize distantly related sequences - or by increasing selectivity, which means lowering the scores for unrelated sequences. Since there are many, many more unrelated sequences in a database than related ones, changes that reduce the scores of unrelated sequences can have dramatic effects.
Remember that sequence data include experimental artifacts
Sequence databases are known to include vector sequences  and other sequencing errors [31,32], including contaminants, chimeric sequences, and shifts in reading frame due to insertion or deletion errors .
Finally, don't try to do too much with what BLAST gives you. Remember that the statistics behind the results only tell you the relative likelihood of finding the given alignment to finding the same alignment by chance under particular assumptions, and do not guarantee biological significance.
The authors thank Nick Grishin and Monica Horvath for helpful discussions.
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.PubMedView ArticleGoogle Scholar
- NCBI BLASt. [http://www.ncbi.nlm.nih.gov/BLAST/]
- WU-BLAST. [http://blast.wustl.edu/]
- Baxevanis AD, Ouellette BFF, (eds): Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley;. 1998Google Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press;. 1998View ArticleGoogle Scholar
- Higgins D, Taylor W, (eds): Bioinformatics: Sequence, Structure and Databanks. New York: Oxford University Press;. 2000Google Scholar
- Kanehisa M: Post-Genome Informatics. New York: Oxford University Press;. 2000Google Scholar
- Gibas L, Jambeck P: Developing Bioinformatics Computer Skills. Sebastopol, California: O'Reilly and Associates;. 2001Google Scholar
- Wake DB: Comparative terminology. Science. 1994, 265: 268-269.PubMedView ArticleGoogle Scholar
- Wake DB: Homoplasy, homology and the problem of 'sameness' in biology. Novartis Found Symp. 1999, 222: 24-33.PubMedGoogle Scholar
- Reeck GR, de Haen C, Teller DC, Doolittle RF, Fitch WM, Dickerson RE, Chambon P, McLachlan AD, Margoliash E, Jukes TH, et al: "Homology" in proteins and nucleic acids: a terminology muddle and a way out of it. Cell. 1987, 50: 667-PubMedView ArticleGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988, 85: 2444-2448.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching molecular sequence databases. Nat Genet. 1994, 6: 119-129.PubMedView ArticleGoogle Scholar
- Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991, 11: 635-650.PubMedView ArticleGoogle Scholar
- Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol. 2001, 52: 540-542.PubMedView ArticleGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919.PubMedPubMed CentralView ArticleGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, vol. 5. Edited by Dayhoff MO. Washington DC: National Biomedical Research Foundation;. 1978, 345-352.Google Scholar
- States DJ, Gish W, Altschul SF: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: A Companion to Methods in Enzymology. 1991, 3: 66-70.View ArticleGoogle Scholar
- Henikoff S, Henikoff JG: Protein family classification based on searching a database of blocks. Genomics. 1994, 19: 97-107. 10.1006/geno.1994.1018.PubMedView ArticleGoogle Scholar
- Henikoff S, Henikoff JG: Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991, 19: 6565-6572.PubMedPubMed CentralView ArticleGoogle Scholar
- NCBI FTP directory - BLAST matrices. [ftp://ncbi.nlm.nih.gov/blast/matrices]
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453.PubMedView ArticleGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197.PubMedView ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL: GenBank. Nucleic Acids Res. 2000, 28: 15-18. 10.1093/nar/28.1.15.PubMedPubMed CentralView ArticleGoogle Scholar
- GenBank. [http://www.ncbi.nlm.nih.gov/Genbank/]
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28: 45-48. 10.1093/nar/28.1.45.PubMedPubMed CentralView ArticleGoogle Scholar
- SWISS-PROT. [http://www.expasy.ch/sprot/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Karlin S, Altschul SF: Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA. 1993, 90: 5873-5877.PubMedPubMed CentralView ArticleGoogle Scholar
- Lamperti ED, Kittelberger JM, Smith TF, Villa-Komaroff L: Corruption of genomic databases with anomalous sequence. Nucleic Acids Res. 1992, 20: 2741-2747.PubMedPubMed CentralView ArticleGoogle Scholar
- Kristensen T, Lopez R, Prydz H: An estimate of the sequencing error frequency in the DNA sequence databases. DNA Seq. 1992, 2: 343-346.PubMedGoogle Scholar
- Lopez R, Kristensen T, Prydz H: Database contamination. Nature. 1992, 355: 211-10.1038/355211a0.PubMedView ArticleGoogle Scholar
- States DJ, Botstein D: Molecular sequence accuracy and the analysis of protein coding regions. Proc Natl Acad Sci USA. 1991, 88: 5518-5522.PubMedPubMed CentralView ArticleGoogle Scholar
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995, 269: 496-512.PubMedView ArticleGoogle Scholar
- Ichikawa T, Suzuki Y, Czaja I, Schommer C, Lessnick A, Schell J, Walden R: Identification and role of adenylyl cyclase in auxin signalling in higher plants. Nature. 1997, 390: 698-701.PubMedGoogle Scholar
- Ichikawa T, Suzuki Y, Czaja I, Schommer C, Lessnick A, Schell J, Walden R: Identification and role of adenylyl cyclase in auxin signalling in higher plants. Nature. 1998, 396: 390-10.1038/24659.PubMedView ArticleGoogle Scholar
- Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990, 87: 2264-2268.PubMedPubMed CentralView ArticleGoogle Scholar
- Full list of the BLAST Advanced options. [http://www.ncbi.nlm.nih.gov/BLAST/full_options.html]