| HN Mirror

BLAST is so deeply embedded in modern genome biology that it is hard to find short descriptions. Basically, there are two parts:

(1) local sequence (string) similarity scores (scores that allow mismatches, but need not extend to the end of either string) are distributed as the "extreme value distribution", which means that it is very easy to distinguish between similarity by chance and similarity because of common ancestry (I think of the names for the same entity as ancestral). This is a very powerful concept, that allows one to accurately set false-positive rates (it does not help with false negatives)

(2) BLAST calculates local similarity scores at O(n^2)/K where K is very very large by first identifying significant conserved "words" using a lookup table, then looking for runs of those words along an alignment diagonal, and then trying to extend that diagonal using an efficient banded dynamic programming algorithm.