| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ggreer 4467 days ago
	pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms. Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.

3 comments

ggreer 4466 days ago

I gave some thought to the right tool for the job of searching DNA.

DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.

After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.

1. https://github.com/laurikari/tre/ (used by agrep)

2. Fast nGram-Based String Search Over Data Encoded Using Algebraic Signatures http://cedric.cnam.fr/~rigaux/papers/LMRS07.pdf

link

tveita 4467 days ago

If ag knows it can't search the whole file, it should at least give a warning. Or why not use search_stream?

Silently skipping parts of it seems like the worst thing to do.

link

ggreer 4467 days ago

Good point about the warning. I'll add that. With regards to search_stream() in search.c... all I can say is that I'm sorry:

https://github.com/ggreer/the_silver_searcher/blob/master/sr...

I built ag for myself; both as a tool and to improve my skills profiling, benchmarking, and optimizing. Had I known how popular it would become, I would have definitely held myself to a higher standard, or any standard. Most importantly, I'd have written tests. These days, I'm busy with a startup so progress on those fronts has been slow.

link

masterj 4467 days ago

For me ag has been magical. It does exactly what I want 99% of the time and is just blazingly fast.

So.. thanks :)

link

x0x0 4467 days ago

it's an awesome tool I use dozens of times a day, so thank you

link

mpercy 4466 days ago

ag is incredible, especially paired with Ack.vim and a mapping. I use <leader>as to search for the current word under the cursor. The results are instantaneous. With ag and YouCompleteMe, I never fall back to cscope/ctags in C++ projects anymore.

One thing though, it skips certain source files seemingly arbitrarily without the -t param and I haven't figured out why... Doesn't seem related to any .gitignore entries that I have been able to identify.

link

ihodes 4466 days ago

Good to know, and that makes sense to me. Thank you for adding a warning, as well.

ack turns out to be much faster than grep on these large files, FWIW.

Thanks for making this superb tool :)

link