pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms.
Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.
I gave some thought to the right tool for the job of searching DNA.
DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.
After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.
I built ag for myself; both as a tool and to improve my skills profiling, benchmarking, and optimizing. Had I known how popular it would become, I would have definitely held myself to a higher standard, or any standard. Most importantly, I'd have written tests. These days, I'm busy with a startup so progress on those fronts has been slow.
ag is incredible, especially paired with Ack.vim and a mapping. I use <leader>as to search for the current word under the cursor. The results are instantaneous. With ag and YouCompleteMe, I never fall back to cscope/ctags in C++ projects anymore.
One thing though, it skips certain source files seemingly arbitrarily without the -t param and I haven't figured out why... Doesn't seem related to any .gitignore entries that I have been able to identify.
Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.