|
|
|
|
|
by keithwhor
4135 days ago
|
|
A position weight matrix would certainly work, but can't be implemented to run as quickly. If a nucleotide could be represented as 0.6A and 0.4T, for example, representing it as "W" is nearly as accurate and can be compared using bit operations far more quickly. Think about a length 20 sequence, if only 5 in 100 identified high-affinity 20-mers contained a "C" at position 0, and the others all contained a "G", do I really care about the weight matching, or can I approximate that position as a "G" and still get roughly the same results? (Though it would be interesting to apply a PWM to the top [x] results of this algorithm, once completed, to specify exact rank.) |
|
It really depends on how degenerate your motif really is...
However, once you start adding in the probabilities, I think that it might be better to do the proper calculation across the board.
Without knowing what you're actually looking for, it's hard to pinpoint what the optimal algorithm should be. The 4-bit optimization is a common choice for ambiguous sequences, so that might be a good place to start (and as a bonus, the revcomp can be a simple bitstring reversal if done correctly). But I have my doubts. Hell, given what you've said you're trying to do, a well formed regex might even work just as well. :)