Hacker News new | ask | show | jobs
by keithwhor 4131 days ago
Author here. :)

The article wasn't meant to be highly comprehensive, but if my method is indeed new, I'd be more than happy spending more time writing an article that's a bit more technical. (To note, storing and comparing nucleotide sequences as binary strings isn't novel in and of itself. I haven't found evidence of the method of comparison I've used, however.)

2 comments

The method you are using for comparison is commonly simdized and used for string matching.

Note also that intel/AMD SSE4+ has a 32 bit/64 bit popcnt instruction with 3 cycle latency/1 cycle throughput (for both 32 bit and 64 bit version), and so is faster for counting bits/matches than any of the methods you are using :)

Nice! Would love to see it implemented - though that popcnt instruction isn't accessible via JavaScript as far as I'm aware. ;)
Also note - your while loop isn't 0 cycles, in fact, it's probably worse than the bit-twiddling depending on sparseness.

This is because the test and branch always happens, and is essentially guaranteed to be mispredicted a lot.

These branch mispredictions are likely to cost a lot more than the n cycles it theoretically saves.

Doesn't BLAT do that? (2bit format)
BLAT doesn't use a 4-bit format for ambiguous bases, but it's a common enough optimization. I've done it before in private aligners... and I'm sure I'm not the first.

The problem with 4-bit format is that ambiguous bases are so rare in a full reference that it's more efficient to handle the ambiguous bases in a special code path, rather than waste the 2 extra bits per base.

The 2-bit format stores only ACGT, with regions of 'N's stored separately (since N's are mainly contiguous in the reference genomes).

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_...

You're right about the rarity of ambiguous bases in reference sequences, but potential search sequences are important as well.

The reason I started development on this to begin with was because I found myself banging my head against a wall when trying to use BLAST to identify sequences that matched a rather loosely-defined DNA-binding consensus sequence. The potential combinations of nucleotides multiply exponentially when you have degenerate nucleotides in your search query.

It wasn't the reference I was concerned with, it was the sequence I was searching for. :)

If you're looking for a DNA binding sequence, you might be using the wrong algorithm here... Have you thought of using a position weight matrix (PWM)[1]? That's the way that I've always searched for binding sites, since that's how motifs are usually described. You can use these to still match ambiguous sequences, but favor some bases more than others.

[1] http://en.wikipedia.org/wiki/Position_weight_matrix

A position weight matrix would certainly work, but can't be implemented to run as quickly. If a nucleotide could be represented as 0.6A and 0.4T, for example, representing it as "W" is nearly as accurate and can be compared using bit operations far more quickly.

Think about a length 20 sequence, if only 5 in 100 identified high-affinity 20-mers contained a "C" at position 0, and the others all contained a "G", do I really care about the weight matching, or can I approximate that position as a "G" and still get roughly the same results?

(Though it would be interesting to apply a PWM to the top [x] results of this algorithm, once completed, to specify exact rank.)

If you're going to go that route, you might as well just do a hashed aligner with a bitmasked searching. Then after the face, you can do the PWM calculation. This is how the BFAST aligner works...

It really depends on how degenerate your motif really is...

However, once you start adding in the probabilities, I think that it might be better to do the proper calculation across the board.

Without knowing what you're actually looking for, it's hard to pinpoint what the optimal algorithm should be. The 4-bit optimization is a common choice for ambiguous sequences, so that might be a good place to start (and as a bonus, the revcomp can be a simple bitstring reversal if done correctly). But I have my doubts. Hell, given what you've said you're trying to do, a well formed regex might even work just as well. :)

What proportion of bases in your query were ambiguous? If it was a fairly low percentage (~10-20%?), I think you could probably get away with using LASTZ, treating all ambiguous bases as completely ambiguous (--ambiguous=iupac), then write a short script to go through the alignments afterward and filter out ones that don't have a reference base that matches the IUPAC character in the query.

But that's the lazy approach :) Good article.

Thanks. :)

Sometimes reinventing a wheel (or at the very least a spoke or two) can be fun and provide some neat results.

Degenerate nucleotide matching (using this method) is only possible with 4 bits of data.

If BLAT is using this method for counting matches, I'm unaware. The algorithm is reliant on storing nucleotides in 4 bits, so I doubt it uses this exact method.

Actually, given huffman coding, you can do it with fewer than 4 because the degenerates are so rare.
Kills the algorithm (as described here) performance when you have nonstandard nucleotide widths (in bits). Rather, makes it completely unsuitable. It's dependent upon fixed-width binary data.