Hacker News new | ask | show | jobs
by mbreese 4131 days ago
If you're looking for a DNA binding sequence, you might be using the wrong algorithm here... Have you thought of using a position weight matrix (PWM)[1]? That's the way that I've always searched for binding sites, since that's how motifs are usually described. You can use these to still match ambiguous sequences, but favor some bases more than others.

[1] http://en.wikipedia.org/wiki/Position_weight_matrix

1 comments

A position weight matrix would certainly work, but can't be implemented to run as quickly. If a nucleotide could be represented as 0.6A and 0.4T, for example, representing it as "W" is nearly as accurate and can be compared using bit operations far more quickly.

Think about a length 20 sequence, if only 5 in 100 identified high-affinity 20-mers contained a "C" at position 0, and the others all contained a "G", do I really care about the weight matching, or can I approximate that position as a "G" and still get roughly the same results?

(Though it would be interesting to apply a PWM to the top [x] results of this algorithm, once completed, to specify exact rank.)

If you're going to go that route, you might as well just do a hashed aligner with a bitmasked searching. Then after the face, you can do the PWM calculation. This is how the BFAST aligner works...

It really depends on how degenerate your motif really is...

However, once you start adding in the probabilities, I think that it might be better to do the proper calculation across the board.

Without knowing what you're actually looking for, it's hard to pinpoint what the optimal algorithm should be. The 4-bit optimization is a common choice for ambiguous sequences, so that might be a good place to start (and as a bonus, the revcomp can be a simple bitstring reversal if done correctly). But I have my doubts. Hell, given what you've said you're trying to do, a well formed regex might even work just as well. :)

While regex could replace the conditional check in the "initial" (string-only) implementation to allow for degenerate nucleotide matching, it would certainly not speed it up there.

Additionally, I'm not sure how you'd intend to do the exhaustive alignment checks / scoring with a regex. (I want to find all sequences that match my 20-mer with an identity score of at least 14/20 --- what's my well-formed regex for that? There are 38,760 different ways to choose 14 of 20 nucleotides.)

Plus, there's the added overhead of having to come up with a well-formed regex to begin with. :)