|
|
|
|
|
by ggreer
4419 days ago
|
|
pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms. Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive. |
|
DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.
After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.
1. https://github.com/laurikari/tre/ (used by agrep)
2. Fast nGram-Based String Search Over Data Encoded Using Algebraic Signatures http://cedric.cnam.fr/~rigaux/papers/LMRS07.pdf