Having a BLAST with bioinformatics (and avoiding BLASTphemy)
Genome Biology volume 2, Article number: reviews2002.1 (2001)
Searching for similarities between biological sequences is the principal means by which bioinformatics contributes to our understanding of biology. Of the various informatics tools developed to accomplish this task, the most widely used is BLAST, the basic local alignment search tool. This article discusses the principles, workings, applications and potential pitfalls of BLAST, focusing on the implementation developed at the National Center for Biotechnology Information.
Similarity searching, including sequence comparison, is one of the principal techniques used by computational biologists and has found widespread use among biologists in general. The most popular tool for this purpose is BLAST (basic local alignment search tool) , which performs comparisons between pairs of sequences, searching for regions of local similarity. In the 11 years since its publication, the original paper describing BLAST  has been cited over 12,000 times, and use of BLAST has become a fundamental tool of biology. It is therefore important to know how it works and what it accomplishes, how to use it properly and how to interpret someone else's published results (see Box 1). Today there are several implementations of the BLAST algorithm, with two that share a common ancestry - NCBI BLAST and WU-BLAST - enjoying the broadest use. NCBI BLAST is available from the National Center for Biotechnology Information (NCBI) , while WU-BLAST is available from Washington University in St. Louis . This article discusses the principles, workings, applications and potential pitfalls of BLAST, focusing on the NCBI version. Further details can be found in several excellent resources [4,5,6,7,8], and additional BLAST-based programs are listed in Table 1.
What does sequence comparison measure? Similarity versus homology
In describing sequence comparisons, several different terms are commonly (mis)used: identity, similarity and homology. Even though they are often used interchangeably, they have quite different meanings. Sequence identity refers to the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences. Sequence similarity takes approximate matches into account, and is meaningful only when such substitutions are scored according to some measure of 'difference' or 'sameness' with conservative or highly probably substitutions assigned more favorable scores than non-conservative or unlikely ones. The term 'sequence homology' is the most important (and the most abused) of the three. When we say that sequence A has high homology to sequence B, then we are making two distinct claims: not only are we saying that sequences A and B look much the same, but also that all of their ancestors also looked the same, going all the way back to a common ancestor. Although the first of these claims is easily verified, the second is frequently in doubt. Although the comparison of two sequences is often summarized as a percentage sequence homology, that usage is generally incorrect as the value really indicates identity and/or similarity, and does not necessarily reflect an evolutionary relationship.
The discussion is not merely about terminology, however, but goes to the core of biology itself (see, for example, [9,10,11]). This point is beautifully articulated by David Wake in a 1994 book review : "Homology is the central concept for all of biology. Whenever we say that a mammalian hormone is the 'same' hormone as a fish hormone, that a human gene sequence is the 'same' as a sequence in a chimp or a mouse, that a HOX gene is the 'same' in a mouse, a fruit fly, a frog, and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about. But a deeper reflection shows that this confidence is based more on hope than on certainty." Sequence comparison algorithms such as BLAST and FASTA  (which employ heuristic algorithms to search a sequence database for the closest matches to a query sequence), and SSEARCH [13,14] (which does a full local alignment of each sequence pair by a dynamic programming method) do not measure sequence homology: they measure sequence similarity and identity. Inferences of homology can only be supplied by the user, a point reinforced by a recent letter to the editor of the Journal of Molecular Evolution entitled "The closest BLAST hit is often not the nearest neighbor." 
Why do we want to know how similar two sequences are? Because Nature has solved the same problem many times, sometimes with significant similarity among the solutions. This means that the identification of similarity between sequences saves us countless biologist-years by enabling us to assign information known about one sequence to other similar sequences.
Before the similarity of two sequences can be computed, their proper alignment must be determined - an inherently circular problem, given that evaluating an alignment requires calculating similarities (Figure 1). The question 'How similar are two sequences?' is not as simple as it seems (see, for example, ). It is, in fact, several questions: Is there a perfect match between the two sequences? If there is no perfect match, what is the best alignment between the two sequences? How should alignments be scored? And if gaps are allowed, how should they be scored? Answering these questions requires three things: a means of scoring matches and mismatches, a means of scoring gaps, and a method of using the two to evaluate numerous possible alignments.
Scoring metrics: statistical versus biological
When evaluating a sequence alignment, one would like to know how meaningful it is. This requires a scoring matrix, or a table of values that describes the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases at a given position are the same. All matches are given the same score (typically +1 or +5), as are all mismatches (typically -1 or -4). But with proteins the situation is different. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another, such as the chemical nature and frequency of occurrence of the amino acids. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). There are two major forces that drive the amino-acid substitution rates away from uniformity: not all substitutions occur with the same frequency, and some substitutions are less functionally tolerated than others and are therefore selected against.
Commonly used substitution matrices include the blocks substitution (BLOSUM)  and point accepted mutation (PAM) [17,18] matrices. Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods. The PAM matrices (Figure 2a) were calculated based on a model of evolutionary distance from alignments of closely related sequences (at least 85% identical) from 34 superfamilies grouped into 71 evolutionary trees and containing 1,572 changes, or point mutations. The stringent similarity threshold was chosen to minimize both errors in the alignments and coincident mutations. Phylogenetic trees were reconstructed for these sequences to determine the ancestral sequence for each alignment. Substitutions were tallied by type, normalized over usage frequencies and converted to log odds scores (see Figure 2 legend). The resulting matrix was called M1 or PAM1 and defines a unit of evolutionary change: the values in the M1 matrix represent the probability that one amino acid in 100 will undergo substitution. Multiplying the PAM1 matrix by itself generates scoring matrices for arbitrary degrees of relatedness; multiplying it by itself n times gives a scoring matrix for proteins that have undergone n multiple, independent mutations. The PAM120 matrix is considered a good scoring matrix for closely related sequences, while the PAM250 matrix is more appropriate for more distantly related sequences. Multiplication also multiplies the error associated with each estimate of amino-acid replacement probability, unfortunately, meaning that the PAM matrices of higher order are more prone to error.
The BLOSUM matrices (Figure 2b) were constructed in a similar manner, but from sequences that were selected to avoid frequently occurring, highly related sequences. The underlying data were derived from the BLOCKS database [19,20], which is a set of ungapped alignments of sequences from families of related proteins. Using about 2,000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, the sequences in each block were sorted into closely related clusters and the frequencies of substitutions between these clusters within a family used to calculate the probability of a meaningful substitution. The number associated with a BLOSUM matrix (such as BLOSUM62 or BLOSUM80) indicates the cutoff value for the percentage sequence identity that defines the clusters. Lower cutoff values allow more diverse sequences into the groups, and the corresponding matrices are therefore appropriate for examining more distant relationships.
When using BLAST on the NCBI website, one may choose from several different amino-acid scoring matrices: PAM30, PAM70, BLOSUM45, BLOSUM62 and BLOSUM80. A more complete set of scoring matrices, ranging from PAM10 to PAM500, and BLOSUM30 to BLOSUM100, is available from the NCBI FTP site  (see Table 2) and can be used with the stand-alone application using the -M flag (see Table 3); nucleotide match and mismatch scores can be adjusted with the -r and -q flags.
Mutational events include not only substitutions but also insertions and deletions. The consequence with respect to sequence alignment and comparison is the need to introduce gaps into one or both sequences in order to produce a proper alignment. The penalty for the creation of a gap should be large enough that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time. For example, some protein structural elements tend to evolve as a unit, but entire elements may move relative to one another. Affine gap penalties, which impose an 'opening' penalty for a gap and an 'extension' penalty that decreases the relative penalty for each additional position in an already opened gap, address both of these issues.
NCBI's BLAST page  allows one to choose from several different sets of parameters for scoring gaps (existence penalties of 7, 8, and 9 with an extension penalty of 2, and existence penalties of 10,11 and 12 with an extension penalty of 1). These values can be adjusted with the -G and -E flags in the stand-alone version (See Table 3 for further details of BLAST parameters and options).
The need for an automated way of finding the optimal alignment out of the numerous alternatives is clear, but the method must be consistent and biologically meaningful. "What sounds simple in principle isn't at all simple in practice. Choosing a good alignment by eye is possible, but life is too short to do it more than once or twice."  To guarantee that you have the best alignment, many (but not all possible) alignments must be generated and evaluated. For two long sequences, doing this directly would take a considerable amount of time, even on the fastest computers. Examining the calculations in detail, however, one might notice that the vast majority of the time would be spent evaluating the same portions of the candidate alignments many times over. This redundant aspect of sequence comparison makes it amenable to a time-saving shortcut called dynamic programming.
Dynamic programming methods were first described in the 1950s, outside the context of bioinformatics, and first applied in this context by Needleman and Wunsch in 1970 . These methods find an optimal solution to a given problem by breaking the original problem into smaller and smaller subproblems until the subproblems have a trivial solution, and then using those solutions to construct solutions for larger and larger portions of the original problem. In sequence comparison, the overall problem is determining the optimal alignment of two sequences. This is broken down into smaller and smaller alignments of parts of one sequence with parts of another sequence to the smallest case, which is the alignment of a single residue from one sequence with a single residue from the other sequence. This solution to this smallest subproblem is known, and is taken from the scoring matrix.
A generalization of the recursive dynamic programming approach, the Smith-Waterman algorithm  is an exhaustive, mathematically optimal method, which handles sequence comparisons in a single computation and is guaranteed to find the highest scoring alignment. The algorithm incorporates the concepts of mismatches and gaps, and identifies optimal local alignments. Local alignments, where parts of one sequence are aligned to parts of another are more biologically relevant than global alignments where entire sequences are aligned to each other, because long regions of high similarity are the exception, rather than the rule, for most biological applications.
Heuristics: sensitivity versus speed
As fast as computers are, and as efficient as the dynamic programming algorithms are, they are still far too slow to enable exhaustive searches of huge sequence repositories such as GenBank [24,25] or SWISS-PROT [26,27]. An exhaustive search of GenBank is still beyond the reach of most researchers' computer power - and with the growth of sequence databases outstripping increases in computation speed, this situation is not going to get better any time soon. This is where BLAST comes in. There are two primary methods for taking even shorter shortcuts by approximating the best local alignment: FASTA and BLAST. Neither is guaranteed to find the best local alignment, but they almost always do. As outlined above, this discussion will focus on BLAST.
BLAST and FASTA are similar in that both operate on the assumption that true matches are likely to have at least some short stretches of high-scoring similarity, but where FASTA looks for exactly matching 'words' (strings of residues), BLAST uses a scoring matrix - BLOSUM62 for amino-acid sequences, by default - to find words that may not match exactly but are high-scoring nevertheless. These high-scoring 'hits' are used as 'seeds' for the slower, more sophisticated dynamic programming algorithm. BLAST also performs some pre-processing of the query sequence - to filter out low-complexity regions (such as CA repeats) and to discard words not likely to form high-scoring pairs. Like FASTA, BLAST does not allow gaps in the primary word-matching pass, but it does in the subsequent Smith-Waterman alignment stage. For this reason, BLAST, like FASTA, has the potential to miss significant similarities present in the database . From a practical standpoint, BLAST is generally the way to go, not only because of its better accuracy, but also because of its availability and its wide acceptance as the standard.
What BLAST does and how it does it
If we define a segment as a contiguous subsequence of a nucleotide or amino-acid sequence, and a segment pair as a pair of segments of the same length, one from each of the two sequences being compared, then the task that BLAST performs is the identification of all pairs of similar segments whose score exceeds a given threshold. The resulting pairs of similar segments are called high-scoring segment pairs (HSPs). The segment pair with the highest score is the maximal-scoring segment pair (MSP); its alignment cannot be improved by extending it or shortening it. There are three major steps in the BLAST algorithm, outlined in Figure 3. Detail for each of the steps is as follows.
In step 1, BLAST filters low complexity regions (CA repeats, for example) and removes them from the query sequence. Low compositional complexity or short-periodicity repeats can yield extremely large numbers of statistically significant but biologically uninteresting results. The filtering and removal of these can be controlled with the -F flag of the stand-alone version of BLAST and with check boxes in the web version. Next, BLAST generates a list of all of short sequences, or words, that make up the query (Figure 3a). The default word lengths are 3 and 11, for amino-acid sequences and nucleotide sequences, respectively, and are adjustable using the -W flag in the stand-alone version. Then, BLAST uses a scoring matrix (BLOSUM62, by default, for amino acids) to determine all high-scoring matching words for each word in the query sequence. No gaps are allowed. The list of matches is reduced by taking only those that will score above a given threshold, called the neighborhood word-score threshold. There is a trade-off at this stage between speed and sensitivity: a higher threshold gives greater speed but increases the chance of missing relevant pairs. Approximately 50 of these matches are usually kept for each of the words generated from the original query.
In the second step, BLAST searches through the target sequence database for exact matches to the word list generated (Figure 3b). Because BLAST has already pre-processed and indexed the databases for the occurrence of all words in each sequence in the database, this search is extremely fast. If a match is found, it is used to seed a possible alignment between the query and the database sequences.
In the third step, the original BLAST method tried to extend the alignment from the matching words in both directions as long as the score continued to increase (Figure 3c). The resulting alignment was called a high-scoring pair, or HSP. Gapped BLAST  uses a lower threshold for generating the list of high-scoring matching words; the algorithm uses short matched regions with no insertions or deletions between them and within a certain distance of each other as the starting points for longer ungapped alignments. These joined regions are then extended using the same method as in the original BLAST.
Next, BLAST determines whether each score found by one of the above methods is greater in value than a given cutoff score S, determined empirically by examining the range of scores given by comparing random sequences and then choosing a value that is significantly greater. The maximal scoring pairs, or MSPs, from the entire database are identified and listed. Finally, BLAST determines the statistical significance of each score, initially by calculating the probability that two random sequences, one the length of the query sequence and the other the length of the database (the sum of the lengths of all of the database sequences) with the same composition (nucleotide or amino acid) could produce the calculated score. Sometimes, two or more segment pairs can be made into a longer alignment; in such cases, a combined assessment of the significance is made by one of two methods : the Poisson method is based on the assumption that the probability of the multiple scores is higher when the lower score of each set is higher; the sum-of-scores method calculates the probability of the sum of the scores. Earlier versions of BLAST use the Poisson method, while later versions, including WU-BLAST and gapped BLAST, use the sum-of-scores method. When the expectation value for a given database sequence satisfies the user-selectable threshold parameter (set by the -e flag with the stand-alone version; see Table 3), the match is reported. 'Reasonable' choices vary, but are typically between 0.1 and 0.001 (see Box 2).
An example of BLAST output is shown in Figure 4. The first part of the output is the header and gives the BLAST program and version used, the reference, and the names and lengths of the query sequence and the target database. The second part is a summary of the sequences producing significant alignments along with normalized (bit) scores and E values. The third part displays the alignments and includes more detailed information about the scores, including raw score, bit score, E value and identity. The fourth part summarizes the parameters used in the search, including the name of the scoring matrix, the gap existence and extension penalties, and the properties of the search space, including the parameters λ and K. If you frame your question carefully, meaning a careful choice of parameters and databases against which to search, BLAST and other sequence comparison tools can provide a vast resource of useful information. But in using sequence similarity to infer homology, one should take care to follow a few simple rules.
Always compare protein sequences if the query sequences encode proteins
Given that nucleotide and protein databases are not uniformly populated, nucleotide and amino-acid sequence comparisons should be used to complement each other. Despite the fact that protein databases tend to be more sparsely populated than nucleotide databases, the constraints of protein evolution - the fact that a protein folds into a functional structure - along with the redundancy of the genetic code, make protein sequence comparison a more powerful tool for inferring structure and function from sequence.
Pay close attention to the statistics
Although most sequences that share significant similarity are homologous, many homologous sequences do not share significant similarity. In addition, repetitive sequences violate certain assumptions made in the statistical theory that underlies BLAST. Ensure that matches are not simply due to biased amino-acid composition. Certain sequences, such as low-complexity regions, can display significant similarity when there is no significant homology. And keep in mind that similarity spread out over a whole domain is likely to be more biologically significant than short, nearly exact matches.
Avoid reporting raw BLAST scores in publications
The significance and meaning of raw BLAST scores depends on many things, so they are, at best, meaningless and may be deceptive. It is much better to show an alignment. Although normalized scores allow comparison of the results of searches using different scoring systems, they are an extreme reduction of the rich information available in an alignment. In addition, when reporting alignments, do not assume that the alignment that BLAST returns is the correct one.
Know the difference between sensitivity and selectivity
Similarity searching techniques can be improved either by increasing sensitivity - the ability of a method to recognize distantly related sequences - or by increasing selectivity, which means lowering the scores for unrelated sequences. Since there are many, many more unrelated sequences in a database than related ones, changes that reduce the scores of unrelated sequences can have dramatic effects.
Remember that sequence data include experimental artifacts
Sequence databases are known to include vector sequences  and other sequencing errors [31,32], including contaminants, chimeric sequences, and shifts in reading frame due to insertion or deletion errors .
Finally, don't try to do too much with what BLAST gives you. Remember that the statistics behind the results only tell you the relative likelihood of finding the given alignment to finding the same alignment by chance under particular assumptions, and do not guarantee biological significance.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.
NCBI BLASt. [http://www.ncbi.nlm.nih.gov/BLAST/]
Baxevanis AD, Ouellette BFF, (eds): Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley;. 1998
Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press;. 1998
Higgins D, Taylor W, (eds): Bioinformatics: Sequence, Structure and Databanks. New York: Oxford University Press;. 2000
Kanehisa M: Post-Genome Informatics. New York: Oxford University Press;. 2000
Gibas L, Jambeck P: Developing Bioinformatics Computer Skills. Sebastopol, California: O'Reilly and Associates;. 2001
Wake DB: Comparative terminology. Science. 1994, 265: 268-269.
Wake DB: Homoplasy, homology and the problem of 'sameness' in biology. Novartis Found Symp. 1999, 222: 24-33.
Reeck GR, de Haen C, Teller DC, Doolittle RF, Fitch WM, Dickerson RE, Chambon P, McLachlan AD, Margoliash E, Jukes TH, et al: "Homology" in proteins and nucleic acids: a terminology muddle and a way out of it. Cell. 1987, 50: 667-
Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988, 85: 2444-2448.
Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching molecular sequence databases. Nat Genet. 1994, 6: 119-129.
Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991, 11: 635-650.
Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol. 2001, 52: 540-542.
Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919.
Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, vol. 5. Edited by Dayhoff MO. Washington DC: National Biomedical Research Foundation;. 1978, 345-352.
States DJ, Gish W, Altschul SF: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: A Companion to Methods in Enzymology. 1991, 3: 66-70.
Henikoff S, Henikoff JG: Protein family classification based on searching a database of blocks. Genomics. 1994, 19: 97-107. 10.1006/geno.1994.1018.
Henikoff S, Henikoff JG: Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991, 19: 6565-6572.
NCBI FTP directory - BLAST matrices. [ftp://ncbi.nlm.nih.gov/blast/matrices]
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453.
Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL: GenBank. Nucleic Acids Res. 2000, 28: 15-18. 10.1093/nar/28.1.15.
Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28: 45-48. 10.1093/nar/28.1.45.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Karlin S, Altschul SF: Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA. 1993, 90: 5873-5877.
Lamperti ED, Kittelberger JM, Smith TF, Villa-Komaroff L: Corruption of genomic databases with anomalous sequence. Nucleic Acids Res. 1992, 20: 2741-2747.
Kristensen T, Lopez R, Prydz H: An estimate of the sequencing error frequency in the DNA sequence databases. DNA Seq. 1992, 2: 343-346.
Lopez R, Kristensen T, Prydz H: Database contamination. Nature. 1992, 355: 211-10.1038/355211a0.
States DJ, Botstein D: Molecular sequence accuracy and the analysis of protein coding regions. Proc Natl Acad Sci USA. 1991, 88: 5518-5522.
Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995, 269: 496-512.
Ichikawa T, Suzuki Y, Czaja I, Schommer C, Lessnick A, Schell J, Walden R: Identification and role of adenylyl cyclase in auxin signalling in higher plants. Nature. 1997, 390: 698-701.
Ichikawa T, Suzuki Y, Czaja I, Schommer C, Lessnick A, Schell J, Walden R: Identification and role of adenylyl cyclase in auxin signalling in higher plants. Nature. 1998, 396: 390-10.1038/24659.
Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990, 87: 2264-2268.
Full list of the BLAST Advanced options. [http://www.ncbi.nlm.nih.gov/BLAST/full_options.html]
The authors thank Nick Grishin and Monica Horvath for helpful discussions.
About this article
Cite this article
Pertsemlidis, A., Fondon, J.W. Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biol 2, reviews2002.1 (2001). https://doi.org/10.1186/gb-2001-2-10-reviews2002