BLAST Algorithm


A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straight-forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:

  • Protein-protein (blastp): compares an amino acid sequence against a protein sequence database.
  • Nucl.-nucl (blastn): compares a nucleotide query sequence against a nucleotide sequence database (in general optimized for speed, not sensitivity).
  • Translated nucl.-protein (blastx): compares the six-frame conceptual translation products of a nucleotide query against a protein sequence database.
  • Protein-translated nucl (tblastn): compares a protein query sequence against a sequence database dynamically translated in all six reading frames (useful for searching proteins against EST’s).
  • Translated nucl-translated nucl. (tblastx): compares the six frame translation of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

BLASTuses a heuristic method .  A lookup table is made of all the “words” (short subsequences) in the query sequence. In many types of searches “neighboring” words are included. The database is scanned for matching words (“hot spots”). Gapped and un-gapped extensions are initiated from these matches.

The BLAST Search algorithm


The main idea of BLAST is that there are often High-scoring Segment Pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and the existing sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. However, the exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristic approach that is less accurate than the Smith-Waterman algorithm but over 50 times faster. [8] The speed and relatively good accuracy of BLAST are among the key technical innovations of the BLAST programs.

An overview of the BLAST algorithm (a protein to protein search) is as follows:

    1. Remove low-complexity region or sequence repeats in the query sequence.
      “Low-complexity region” means a region of a sequence composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.
    2. Make a k-letter word list of the query sequence.
      Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) “sequentially”, until the last letter of the query sequence is included.
    3. List the possible matching words.
      This step is one of the main differences between BLAST and FASTA. FASTA cares about all of the common words in the database and query sequences that are listed in step 2; however, BLAST only cares about the high-scoring words. The scores are created by comparing the word in the list in step 2 with all the 3-letter words. By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.
    4. Organize the remaining high-scoring words into an efficient search tree.
      This allows the program to rapidly compare the high-scoring words to the database sequences
    5. Repeat step 3 to 4 for each k-letter word in the query sequence.
    6. Scan the database sequences for exact matches with the remaining high-scoring words.
      The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position. If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.
    7. Extend the exact matches to high-scoring segment pair (HSP).
      • The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension does not stop until the accumulated total score of the HSP begins to decrease
      • To save more time, a newer version of BLAST, called BLAST2 or gapped BLAST, has been developed. BLAST2 adopts a lower neighborhood word score threshold to maintain the same level of sensitivity for detecting sequence similarity. Therefore, the possible matching words list in step 3 becomes longer. Next, the exact matched regions, within distance A from each other on the same diagonal in figure 3, will be joined as a longer new region. Finally, the new regions are then extended by the same method as in the original version of BLAST, and the HSPs’ (High-scoring segment pair) scores of the extended regions are then created by using a substitution matrix as before.
    8. List all of the HSPs in the database whose score is high enough to be considered.
      We list the HSPs whose scores are greater than the empirically determined cutoff score S. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs.
    9. Evaluate the significance of the HSP score.
      BLAST next assesses the statistical significance of each HSP score by exploiting the Gumbel extreme value distribution (EVD). (It is proved that the distribution of Smith-Waterman local alignment scores between two random sequences follows the Gumbel EVD. For local alignments containing gaps it is not proved.). In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation

      {\displaystyle p\left(S\geq x\right)=1-\exp \left(-e^{-\lambda \left(x-\mu \right)}\right)}p\left(S\geq x\right)=1-\exp \left(-e^{{-\lambda \left(x-\mu \right)}}\right)where
      {\displaystyle \mu ={}^{\left[\log \left(Km’n’\right)\right]}\!\!\diagup \!\!{}_{\lambda }\;}\mu ={}^{{\left[\log \left(Km'n'\right)\right]}}\!\!\diagup \!\!{}_{{\lambda }}\;
      The statistical parameters {\displaystyle \lambda }\lambda and {\displaystyle \mathrm {K} }\mathrm{K} are estimated by fitting the distribution of the un-gapped local alignment scores, of the query sequence and a lot of shuffled versions (Global or local shuffling) of a database sequence, to the Gumbel extreme value distribution. Note that {\displaystyle \lambda }\lambda and {\displaystyle \mathrm {K} }\mathrm{K} depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies). {\displaystyle m’}m' and {\displaystyle n’}n' are the effective lengths of the query and database sequences, respectively. The original sequence length is shortened to the effective length to compensate for the edge effect (an alignment start near the end of one of the query or database sequence is likely not to have enough sequence to build an optimal alignment). They can be calculated as

      {\displaystyle m’\approx m-{}^{\left(\ln Kmn\right)}\!\!\diagup \!\!{}_{H}\;}m'\approx m-{}^{{\left(\ln Kmn\right)}}\!\!\diagup \!\!{}_{{H}}\;
      {\displaystyle n’\approx n-{}^{\left(\ln Kmn\right)}\!\!\diagup \!\!{}_{H}\;}n'\approx n-{}^{{\left(\ln Kmn\right)}}\!\!\diagup \!\!{}_{{H}}\;
      where {\displaystyle \mathrm {H} }{\mathrm {H}} is the average expected score per aligned pair of residues in an alignment of two random sequences. Altschul and Gish gave the typical values, {\displaystyle \lambda =0.318}\lambda =0.318, {\displaystyle \mathrm {K} =0.13}{\mathrm {K}}=0.13, and {\displaystyle \mathrm {H} =0.40}{\mathrm {H}}=0.40, for un-gapped local alignment using BLOSUM62 as the substitution matrix. Using the typical values for assessing the significance is called the lookup table method; it is not accurate. The expect score E of a database match is the number of times that an unrelated database sequence would obtain a score S higher than x by chance. The expectation E obtained in a search for a database of D sequences is given by

      {\displaystyle E\approx 1-e^{-p\left(s>x\right)D}}E\approx 1-e^{{-p\left(s>x\right)D}}
      Furthermore, when {\displaystyle p<0.1}p<0.1, E could be approximated by the Poisson distribution as

      {\displaystyle E\approx pD}E\approx pD
      This expectation or expect value “E” (often called an E score or E-value or e-value) assessing the significance of the HSP score for un-gapped local alignment is reported in the BLAST results. The calculation shown here is modified if individual HSPs are combined, such as when producing gapped alignments (described below), due to the variation of the statistical parameters.
    10. Make two or more HSP regions into a longer alignment.
      Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. There are two methods, the Poisson method and the sum-of-scores method, to compare the significance of the newly combined HSP regions. Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52, 45), respectively. The Poisson method gives more significance to the set with the maximal lower score (45>40). However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than 52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU-BLAST uses the sum-of scores method.
    11. Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences.
      • The original BLAST only generates un-gapped alignments including the initially found HSPs individually, even when there is more than one HSP found in one database sequence.
      • BLAST2 produces a single alignment with gaps that can include all of the initially found HSP regions. Note that the computation of the score and its corresponding E-value involves use of adequate gap penalties.
    12. Report every match whose expect score is lower than a threshold parameter E.

Blast Input/Output








Input Sequence in FASTA or Genbank format and weight matrix. BLAST output can be delivered in a variety of formats. These formats include HTML, plain text, and XML formatting. For NCBI’s web-page, the default format for output is HTML. When performing a BLAST on NCBI, the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table.






Uses of BLAST

BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.

Identifying species

With the use of BLAST, you can possibly correctly identify a species or find homologous species. This can be useful,     for example, when you are working with a DNA sequence from an unknown species.

Locating domains
When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest.
Establishing phylogeny
Using the results received through BLAST you can create a phylogenetic tree using the BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-built computational phylogenetic methods, so should only be relied upon for “first pass” phylogenetic analyses.
DNA mapping
When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s).
When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.