|
|
|
|
|
by lcdoutlet
2151 days ago
|
|
I am a big fan of ripgrep. I think the tool is great for quick simple searches.
On larger data sets with sufficiently complicated regex patterns the performance gap is small. I have also encountered situations using rg on med-large datasets where the files may contain Unicode, regular expressions, random garbage... and rg failed. I don't recall a situation like this where the error handling in grep didn't recover and move on to the next file. Here is an example showing the performance gap to be small using (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a ramdisk.
I ran this 50 times and the results are similar. du -sch .
72G .
72G total time ls|xargs -i -P24 rg -c "<.?>" {}
real 0m6.841s
user 2m8.766s
sys 0m5.637s time ls|xargs -i -P24 grep -Ec "<.?>" {}
real 0m6.168s
user 1m35.058s
sys 0m24.177s |
|
I would really love to have a bug report if you could find a way to reproduce this. ripgrep should certainly not fail regardless of the input. I've actually spent a lot of time trying to make sure that ripgrep's error handling and so forth generally match GNU grep's behavior.
> Here is an example showing the performance gap to be small using (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a ramdisk. I ran this 50 times and the results are similar.
Could you show me where to get `enwiki-20200401-pages-articles-multistream.xml`? I'd love to run the benchmark myself, although I don't have a ramdisk quite that big. :P I could shrink the data set though a bit.
If I were to guess though, yeah, I'd generally expect GNU grep and ripgrep to perform similarly here. Speculating:
If the match count of those runs is very high, then total search time is likely dominated by per-match overhead. Both ripgrep and GNU grep should be pretty close there, but there's not too much room to improve drastically.
If the match count of those runs isn't high or is very small, then GNU grep and ripgrep will use a search algorithm that is virtually identical in its performance characteristics. They will both probably look for occurrences of `>` using `memchr`, and then run a full search on matching lines. It's hard to get much faster than that, so again, not a lot of room for differentiating yourself.
If I can get my hands on the corpus, then I'm pretty sure I could give you (non-pathological) patterns on which searches would show very different performance characteristics between GNU grep and ripgrep.