Hacker News new | ask | show | jobs
by discardorama 4141 days ago
I liked the exposition, but it would have been nice if the author had compared his implementation with something more meaty. Like Sustik-Moore[1], for instance.

[1] http://www.cs.utexas.edu/users/moore/publications/sustik-moo...

2 comments

It would have been good to compare the JS implementation to a C implementation. Given the algorithm, it shouldn't be too bad. I'd suspect that the C version would work much better, but I could be surprised.
Funny anecdote, my first implementation of the algorithm was actually in Go (for concurrency sake). It was an earlier version and wasn't as optimized as the one available here, but I found it actually performed 10% slower at matching nucleotides as compared the same algorithm running in V8. At the time, I attributed it to V8 "writing" better-optimized instructions than I possibly could.
C is one of the few languages that might be more error-prone than JS. I'd be more interested to see a comparison to a safer high-performance language, e.g. Haskell.
Author here. :)

The article wasn't meant to be highly comprehensive, but if my method is indeed new, I'd be more than happy spending more time writing an article that's a bit more technical. (To note, storing and comparing nucleotide sequences as binary strings isn't novel in and of itself. I haven't found evidence of the method of comparison I've used, however.)

The method you are using for comparison is commonly simdized and used for string matching.

Note also that intel/AMD SSE4+ has a 32 bit/64 bit popcnt instruction with 3 cycle latency/1 cycle throughput (for both 32 bit and 64 bit version), and so is faster for counting bits/matches than any of the methods you are using :)

Nice! Would love to see it implemented - though that popcnt instruction isn't accessible via JavaScript as far as I'm aware. ;)
Also note - your while loop isn't 0 cycles, in fact, it's probably worse than the bit-twiddling depending on sparseness.

This is because the test and branch always happens, and is essentially guaranteed to be mispredicted a lot.

These branch mispredictions are likely to cost a lot more than the n cycles it theoretically saves.

Doesn't BLAT do that? (2bit format)
BLAT doesn't use a 4-bit format for ambiguous bases, but it's a common enough optimization. I've done it before in private aligners... and I'm sure I'm not the first.

The problem with 4-bit format is that ambiguous bases are so rare in a full reference that it's more efficient to handle the ambiguous bases in a special code path, rather than waste the 2 extra bits per base.

The 2-bit format stores only ACGT, with regions of 'N's stored separately (since N's are mainly contiguous in the reference genomes).

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_...

You're right about the rarity of ambiguous bases in reference sequences, but potential search sequences are important as well.

The reason I started development on this to begin with was because I found myself banging my head against a wall when trying to use BLAST to identify sequences that matched a rather loosely-defined DNA-binding consensus sequence. The potential combinations of nucleotides multiply exponentially when you have degenerate nucleotides in your search query.

It wasn't the reference I was concerned with, it was the sequence I was searching for. :)

If you're looking for a DNA binding sequence, you might be using the wrong algorithm here... Have you thought of using a position weight matrix (PWM)[1]? That's the way that I've always searched for binding sites, since that's how motifs are usually described. You can use these to still match ambiguous sequences, but favor some bases more than others.

[1] http://en.wikipedia.org/wiki/Position_weight_matrix

A position weight matrix would certainly work, but can't be implemented to run as quickly. If a nucleotide could be represented as 0.6A and 0.4T, for example, representing it as "W" is nearly as accurate and can be compared using bit operations far more quickly.

Think about a length 20 sequence, if only 5 in 100 identified high-affinity 20-mers contained a "C" at position 0, and the others all contained a "G", do I really care about the weight matching, or can I approximate that position as a "G" and still get roughly the same results?

(Though it would be interesting to apply a PWM to the top [x] results of this algorithm, once completed, to specify exact rank.)

What proportion of bases in your query were ambiguous? If it was a fairly low percentage (~10-20%?), I think you could probably get away with using LASTZ, treating all ambiguous bases as completely ambiguous (--ambiguous=iupac), then write a short script to go through the alignments afterward and filter out ones that don't have a reference base that matches the IUPAC character in the query.

But that's the lazy approach :) Good article.

Thanks. :)

Sometimes reinventing a wheel (or at the very least a spoke or two) can be fun and provide some neat results.

Degenerate nucleotide matching (using this method) is only possible with 4 bits of data.

If BLAT is using this method for counting matches, I'm unaware. The algorithm is reliant on storing nucleotides in 4 bits, so I doubt it uses this exact method.

Actually, given huffman coding, you can do it with fewer than 4 because the degenerates are so rare.
Kills the algorithm (as described here) performance when you have nonstandard nucleotide widths (in bits). Rather, makes it completely unsuitable. It's dependent upon fixed-width binary data.