|
|
|
|
|
by mbreese
4135 days ago
|
|
BLAT doesn't use a 4-bit format for ambiguous bases, but it's a common enough optimization. I've done it before in private aligners... and I'm sure I'm not the first. The problem with 4-bit format is that ambiguous bases are so rare in a full reference that it's more efficient to handle the ambiguous bases in a special code path, rather than waste the 2 extra bits per base. The 2-bit format stores only ACGT, with regions of 'N's stored separately (since N's are mainly contiguous in the reference genomes). http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_... |
|
The reason I started development on this to begin with was because I found myself banging my head against a wall when trying to use BLAST to identify sequences that matched a rather loosely-defined DNA-binding consensus sequence. The potential combinations of nucleotides multiply exponentially when you have degenerate nucleotides in your search query.
It wasn't the reference I was concerned with, it was the sequence I was searching for. :)