Hacker News new | ask | show | jobs
by nirvanis 1106 days ago
Somewhat related tip: prepend LANG=C to many console commands such as grep to speed up many tools processing large files, as they will assume ASCII input (which is probably what you have in most cases)
2 comments

If you care about speed you would probably be using ripgrep rather than grep anyway, but doesn’t `LANG=en_US.UTF-8` give a similar speed on modern systems without any compromise on consistency of sort ordering etc and support for extended characters?
For GNU grep in particular, no, using a UTF-8 locale can significantly slow it down:

    $ time LC_ALL=C grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    3
    
    real    0.808
    user    0.744
    sys     0.063
    maxmem  10 MB
    faults  0
    
    $ time LC_ALL=en_US.UTF-8 grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    4
    
    real    20.064
    user    19.982
    sys     0.077
    maxmem  10 MB
    faults  0
Where as ripgrep is just Unicode aware by default, and still about as fast as the ASCII only variant of GNU grep above:

    $ time rg '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 
    4
    
    real    1.163
    user    1.132
    sys     0.030
    maxmem  916 MB
    faults  0
For grep, how much of the difference is due to '\w' having a different meaning between the two cases?
That's exactly the point. ripgrep uses the Unicode definition by default and so corresponds to what GNU grep is doing in the en_US.UTF-8 locale.
and set it for consistency of ordering (collation) between sort, join, tsort, look, etc.