| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by burntsushi 1113 days ago

For GNU grep in particular, no, using a UTF-8 locale can significantly slow it down:

    $ time LC_ALL=C grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    3
    
    real    0.808
    user    0.744
    sys     0.063
    maxmem  10 MB
    faults  0
    
    $ time LC_ALL=en_US.UTF-8 grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    4
    
    real    20.064
    user    19.982
    sys     0.077
    maxmem  10 MB
    faults  0

Where as ripgrep is just Unicode aware by default, and still about as fast as the ASCII only variant of GNU grep above:

    $ time rg '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 
    4
    
    real    1.163
    user    1.132
    sys     0.030
    maxmem  916 MB
    faults  0

1 comments

kps 1113 days ago

For grep, how much of the difference is due to '\w' having a different meaning between the two cases?

link

burntsushi 1112 days ago

That's exactly the point. ripgrep uses the Unicode definition by default and so corresponds to what GNU grep is doing in the en_US.UTF-8 locale.

link