Hacker News new | ask | show | jobs
by mbreese 4131 days ago
If you're going to go that route, you might as well just do a hashed aligner with a bitmasked searching. Then after the face, you can do the PWM calculation. This is how the BFAST aligner works...

It really depends on how degenerate your motif really is...

However, once you start adding in the probabilities, I think that it might be better to do the proper calculation across the board.

Without knowing what you're actually looking for, it's hard to pinpoint what the optimal algorithm should be. The 4-bit optimization is a common choice for ambiguous sequences, so that might be a good place to start (and as a bonus, the revcomp can be a simple bitstring reversal if done correctly). But I have my doubts. Hell, given what you've said you're trying to do, a well formed regex might even work just as well. :)

1 comments

While regex could replace the conditional check in the "initial" (string-only) implementation to allow for degenerate nucleotide matching, it would certainly not speed it up there.

Additionally, I'm not sure how you'd intend to do the exhaustive alignment checks / scoring with a regex. (I want to find all sequences that match my 20-mer with an identity score of at least 14/20 --- what's my well-formed regex for that? There are 38,760 different ways to choose 14 of 20 nucleotides.)

Plus, there's the added overhead of having to come up with a well-formed regex to begin with. :)