| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gravypod 2331 days ago

I'm confused by this as well. A while ago I needed to count the number of lines in an ununiformly long text file. I was initially using wc but that got too slow so I wrote something that was like wc but only did line counting. I polished it up during a flight to a con and published it on my "blog" [0]. I thought I'd benchmark it to see if these versions beat hacky and gross my C implementation. They don't seem to do so.

My implementation was very close to what I calculated for my hardware's max throughput within 20x the runtime of doing a direct file copy.

For my test file I'm using a copy of the linux source tree in a singe file generated with: `(find linux/ -type f -name ".c" && find linux/ -type f -name ".h") | xargs cat > linux.txt` which is about 770MB of text.

To get an idea of the maximum possible performance I could hope to achieve:

    cat linux.txt | pv > /dev/null
    769MiB 0:00:00 [3.08GiB/s] [   <=>                                                                                                                                                                     
 ]

Some things I noted when doing my testing is that each program counted the number of lines, characters, and words differently in this corpus. I know for a fact that my program's line count is the only thing I tested at all and I'm assuming `wc` from gnu is well tested. Each program returned consistent results. My guess is there's some control/utf8 character or something in the source that isn't playing nice with everything.

First the haskell implementation that uses optics, concurrency, and other magic:

    $ nproc 
    32
    $ time ./hs-wc lazy linux.txt
    24961824 77271007 807304327 linux.txt
    ./hs-wc lazy linux.txt  6.88s user 0.89s system 317% cpu 2.446 total
    $ time ./hs-wc simple linux.txt
    24961824 77270977 807299417 linux.txt
    ./hs-wc simple linux.txt  509.04s user 432.26s system 1850% cpu 50.867 total

Then the D implementation from this post:

    $ time ./d-wc linux.txt 
    24961824 77270980 807299417 linux.txt
    ./d-wc linux.txt  22.09s user 0.14s system 99% cpu 22.238 total

The implementation that comes with ubuntu 19.10:

    $ time wc linux.txt 
    24961824  77270960 807304327 linux.txt
    wc linux.txt  3.59s user 0.08s system 99% cpu 3.672 total

And finally my simple implementation in C:

    $ time ./mine-wc linux.txt 
    24961824 77270966 807304327 linux.txt
    ./mine-wc linux.txt  1.70s user 0.10s system 99% cpu 1.804 total

    $ gcc -O0 wc.c          
    $ time ./a.out linux.txt
    24961824 77270966 807304327 linux.txt
    ./a.out linux.txt  6.51s user 0.12s system 99% cpu 6.636 total

    $ gcc -Wall -Isrc/ -pedantic-errors -Ofast -ftree-vectorize -msse -msse2 -ffast-math wc.c 
    $ time ./a.out linux.txt
    24961824 77270966 807304327 linux.txt
    ./a.out linux.txt  1.37s user 0.09s system 99% cpu 1.467 total

I might be missing something but from my understanding these are not yet bound by my system IO. It might be in the authors use cases however since each disk, system config, etc is different.

[0] - https://closedjdk.com/post/why-is-wc-so-slow/

1 comments

eMSF 2331 days ago

>I was initially using wc but that got too slow so I wrote something that was like wc but only did line counting.

Any reasonable wc should be fast enough for that purpose; just remember to use the '-l' switch to activate the fast path for line counting.

>Some things I noted when doing my testing is that each program counted the number of lines, characters, and words differently in this corpus.

All programs should report the same line counts (should be exactly the number of line feed characters in the input).

Other than that, it really depends on your current locale and the wc you're using (see https://github.com/expr-fi/fastlwc README for more details). Do note that this D implementation seems to implement the wc character counting behaviour you get with the '-m' switch (instead of the byte counting default).

Also worth noting that different operating systems have different locale definitions; glibc locales explicitly treat non-breaking spaces as non-whitespace characters, while for example Windows doesn't.