Hacker News new | ask | show | jobs
by ggreer 4419 days ago
pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms.

Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.

3 comments

I gave some thought to the right tool for the job of searching DNA.

DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.

After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.

1. https://github.com/laurikari/tre/ (used by agrep)

2. Fast nGram-Based String Search Over Data Encoded Using Algebraic Signatures http://cedric.cnam.fr/~rigaux/papers/LMRS07.pdf

If ag knows it can't search the whole file, it should at least give a warning. Or why not use search_stream?

Silently skipping parts of it seems like the worst thing to do.

Good point about the warning. I'll add that. With regards to search_stream() in search.c... all I can say is that I'm sorry:

https://github.com/ggreer/the_silver_searcher/blob/master/sr...

I built ag for myself; both as a tool and to improve my skills profiling, benchmarking, and optimizing. Had I known how popular it would become, I would have definitely held myself to a higher standard, or any standard. Most importantly, I'd have written tests. These days, I'm busy with a startup so progress on those fronts has been slow.

For me ag has been magical. It does exactly what I want 99% of the time and is just blazingly fast.

So.. thanks :)

it's an awesome tool I use dozens of times a day, so thank you
ag is incredible, especially paired with Ack.vim and a mapping. I use <leader>as to search for the current word under the cursor. The results are instantaneous. With ag and YouCompleteMe, I never fall back to cscope/ctags in C++ projects anymore.

One thing though, it skips certain source files seemingly arbitrarily without the -t param and I haven't figured out why... Doesn't seem related to any .gitignore entries that I have been able to identify.

Good to know, and that makes sense to me. Thank you for adding a warning, as well.

ack turns out to be much faster than grep on these large files, FWIW.

Thanks for making this superb tool :)