Hacker News new | ask | show | jobs
by pie 4419 days ago
It's also worth checking out The Silver Searcher: https://github.com/ggreer/the_silver_searcher
5 comments

Big problem with ag is that it appears to be broken on even moderately larger files where ack works just fine: https://github.com/ggreer/the_silver_searcher/issues/384
pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms.

Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.

I gave some thought to the right tool for the job of searching DNA.

DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.

After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.

1. https://github.com/laurikari/tre/ (used by agrep)

2. Fast nGram-Based String Search Over Data Encoded Using Algebraic Signatures http://cedric.cnam.fr/~rigaux/papers/LMRS07.pdf

If ag knows it can't search the whole file, it should at least give a warning. Or why not use search_stream?

Silently skipping parts of it seems like the worst thing to do.

Good point about the warning. I'll add that. With regards to search_stream() in search.c... all I can say is that I'm sorry:

https://github.com/ggreer/the_silver_searcher/blob/master/sr...

I built ag for myself; both as a tool and to improve my skills profiling, benchmarking, and optimizing. Had I known how popular it would become, I would have definitely held myself to a higher standard, or any standard. Most importantly, I'd have written tests. These days, I'm busy with a startup so progress on those fronts has been slow.

For me ag has been magical. It does exactly what I want 99% of the time and is just blazingly fast.

So.. thanks :)

it's an awesome tool I use dozens of times a day, so thank you
ag is incredible, especially paired with Ack.vim and a mapping. I use <leader>as to search for the current word under the cursor. The results are instantaneous. With ag and YouCompleteMe, I never fall back to cscope/ctags in C++ projects anymore.

One thing though, it skips certain source files seemingly arbitrarily without the -t param and I haven't figured out why... Doesn't seem related to any .gitignore entries that I have been able to identify.

Good to know, and that makes sense to me. Thank you for adding a warning, as well.

ack turns out to be much faster than grep on these large files, FWIW.

Thanks for making this superb tool :)

The silver searcher is pretty good. but it has a couple of big problems. It does not parse the .gitignore correctly [0], so it frequently searches files that are not committed to your repo. This, combined with the decision to print 10000 character long lines mean a lot of search results are useless.

[0] https://github.com/ggreer/the_silver_searcher/issues/367 for example

I noticed the issue you mentioned, but as the last comment mentions, I believe this has already been fixed. My specific case at least was resolved by updating from master.
The author notes that git grep is faster than ag when you're lucky enough to be searching a repo.
I switched to this after Ack 2 removed all options to search through binary files.

ag is less picky, and the increased speed is a nice bonus.

Binary file search was my primary motivator as well. I really do love the functionality of the tool.
1000 times this... it is exactly like ack-grep but faster :)
One thing I miss a little is that ack has the super convenient:

    ack --java "foo"
while with ag you write:

    ag -G"\.java$" "foo"
But yes, ack and ag feel pretty identical except for the speed. Most of the time the speed improvement is irrelevant to me, except sometimes now I'll use ag in my home folder, and it's still fairly snappy.
That was too much typing anyway. When you mostly work with one language something like this is nice (in my case c/c++): alias ack-cpp='ack-grep --type=cpp --type=cc'
I have that aliased to 'cack'. Then ruby is 'rack', python is 'pack', go is 'gack', etc.

(I've never needed to use the rackup "rack" command directly, fortunately, if you do you ought to use a different alias)

> I've never needed to use the rackup "rack" command directly, fortunately, if you do you ought to use a different alias

Or escape alias \rack

or $(which rack)
ag has had this for a while now (I'm on version 0.21.0).
nice! i need to pull and recompile!
Judging by this: https://news.ycombinator.com/item?id=7710269 I guess ag now supports that, too.
Thanks for the ack tip! Anything else super useful come to mind?
ag is not "exactly like" ack. There are features that ack has that ag has chosen not to replicate. ack is also more portable.

That's not a knock on ag at all, and if ag fits your needs, then by all means use it.

I normally tell people to use ack because it's like grep but faster (owing to it's sensible defaults) ... if I use this I'm worried I might go too fast and travel backwards in time or something.
Give ag a shot, you'll be able to relive the emotions you felt when you switched to ack from grep, but this time you're switching to ag from ack.
How is it so fast? Files are mmap()ed instead of read into a buffer.

It's hard to believe this would give a significant performance boost. Is there evidence of this?

In my benchmarking, mmap() was about 20% faster than read() on OS X, but the same speed on Ubuntu. Pretty much everything else in the list (pthreads, JIT regex compiler, Boyer-Moore-Horspool strstr(), etc) improves performance more than mmap().

Also, mmap() has the disadvantage that it can segfault your process if something else makes the underlying file smaller. In fact, there have been kernel bugs related to separate processes mmapping and truncating the same file.[1] I mostly use mmap() because my primary computer is a mac.

A side note: Parts of OS X's kernel seem... not very optimized to say the least. See the bar graph at http://geoff.greer.fm/2012/09/07/the-silver-searcher-adding-... for an example.

1. http://lwn.net/Articles/357767/

There is a nice (and often posted) mailing list post explaining some of the reasons GNU Grep is so fast:

http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...

He mentions: "So even nowadays, using --mmap can be worth a >20% speedup."

Now I'm burning with curiosity. I have to know why! My plan:

- replicate the experiment, confirm --mmap shaves off a non-negligible amount of time. It could be that his computer happened to be running something in the background that was using his harddrive, for example, which would skew the results.

- look at the code, figure out the exact difference between what --mmap is doing and what it does by default. Confirm that the problem isn't in grep itself (it's probably not, but it's important to check).

- dig into the kernel source to figure out the difference under the hood and why it might be faster.

I wonder if it has to do with not having to copy data back and forth between kernel and userspace. My mildly uneducated thought is that you could do this with splice() or whatever, but mmap is an easy drop-in replacement.

edit: I've been reading your posts for a while and I like them, but I keep wondering, why do you have sillysaurus1-2-3?

That's what has me so curious, because it doesn't seem like copying between kernel/userspace should account for a 20% speed drop. Once data is in the L3 CPU cache, it should be inexpensive to move it around.

Regarding my ancestry, I'm sillysaurus3 because I've (rightfully) been in trouble twice with the mods for getting too personal on HN. I apologized and changed my behavior accordingly, and additionally created a new account both times to serve as a constant reminder to be objective and emotionless. There's rarely a reason to argue with a person rather than with an idea. Debating ideas, not people, has a bunch of nice benefits: it's easier to learn from your mistakes, it makes for better reading, etc. It's pretty important, because forgetting that principle leads to exchanges like https://news.ycombinator.com/item?id=7700145

Another nice benefit of creating a new account is that you lose your downvoting privilege for a time, which made me more thoughtful about whether a downvote is actually justified.

Possibly the OS is doing interesting things with file access and caching and opting out of that has benefits for this particular workload?

...

I just skimmed the bsd mailing list email on why grep is fast which was linked up-thread, and it seems that's somewhat the case. It sounds like since they are doing advanced search techniques on what matches or can match, they use mmap to avoid requiring the kernel copy every byte into memory, when they know they only need to look at specific ranges of bytes in some instances. At least that was the case at some point in the past.

Finally, when I was last the maintainer of GNU grep (15+ years ago...), GNU grep also tried very hard to set things up so that the _kernel_ could ALSO avoid handling every byte of the input, by using mmap() instead of read() for file input. At the time, using read() caused most Unix versions to do extra copying.

P.S. Nice attitude, it earned an upvote from me. Which is probably one reason why your third account has more karma than my first.

You can do this with large reads:

     read(fd, buf, 100 megs)
can, in the kernel, do something like:

     read(fd, buf[0:first-page-boundary])
     remap(fd, buf[first-page-boundary:last-page-boundary])
     read(fd, buf[last-page-boundary:end])
There you go, zero copy reads. Or at least, minimal copy reads -- at most you will get 2 pages worth of copying.
The last time I had to do fast, large sequential disk reads on Linux it was surprisingly complex to get all the buffering/caching/locking to not do the wrong thing and slow me down a lot. I wouldn't be surprised if non-optimized mmap() is a whole lot faster than non-optimized use of high level file i/o libraries.
> It's hard to believe this would give a significant performance boost.

Why is that so hard to believe? It's a standard optimization—the kernel can almost certainly coordinate reading better than your userspace C can.

So are we talking about constant-time optimization, then? I.e. it shaves off a few milliseconds regardless of how complex the search is, or how many files it's reading, or how large each file is. I'll happily concede that mmap() might do that. But a performance boost linear w.r.t. search complexity/number of files/filesize? Hard to believe, and I should go measure it myself to prove the point or learn why I'm mistaken.
Likely a constant time improvement... for each file being searched.

I don't think that anybody is claiming that mmapping actually changes the algorithmic complexity of the actual search operations.

Constant-time improvements are still improvements, especially if they're in an inner loop. Otherwise we would all be using Python and just writing great algorithms.