Hacker News new | ask | show | jobs
by vineetg 3979 days ago
Thanks for bringing up all of these alternatives. We definitely would have preferred to use an existing solution to building our own.

Unfortunately, a lot of the existing software is not intended for the search we're trying to do, or does not perform well under these conditions. We did in fact experiment with some of them before building our own. Bowtie, for example, doesn't allow more than 3 mismatches, and is also intended for alignments where there are very few matches (close to 1).

Since we need to be able to support multiple genomes (see Josh's comment), the amount of RAM we need to run a particular set of alignments is relatively important. Things like BLAT (which seems also intended for > 25 bases) need to keep the entire genome index in memory. This means that we would need to spin up a lot more servers to handle parallel requests, especially with different genomes.

FWIW our search is only a couple hundred lines of C++, and does the search with very little memory requirements.

1 comments

Fair points about both the memory requirement and Bowtie - also, 20 BP is really small, so while BLAT seems to find 20 BP hits correctly, it doesn't find 20 BP hits with mismatches. Why be so memory stringent? This is genomics, after all... If you're adverse to rolling your own, I'd look into BWA-MEM as a solution (which has more adjustable mismatch scoring than Bowtie), or figure out how Primer-BLAST seems to do specificity checks for small sequences so well.
We care about reducing memory usage because it gives us greater flexibility in our infrastructure. We're offering this to hundreds of scientists a day - not huge numbers in the grand scheme of things, but usage is bursty and by paying customers. This means we can mix in CRISPR work with our other processing infrastructure, and we can choose between many servers with low memory vs a few servers with high memory.

Maybe genomics doesn't have to be so memory hungry. 8)