Hacker News new | ask | show | jobs
by greggyb 2329 days ago
Perhaps I'm just being pedantic or maybe I am misunderstanding.

The author claims to be IO bound toward the end. But they are comparing to two versions that are faster.

It is my understanding that IO-bound means that the IO subsystem is the thing which limits run time of the program. But the author clearly demonstrates that the IO subsystem of their machine is capable of supporting faster wc binaries.

So what am I missing here?

6 comments

You are pretty much right. It isn't IO bound.

If your definition said IO must be the only thing limiting the time, then few programs would be IO bound, except trivial ones like "count arriving packets". In a packet counter, your "implementation" would not affect wall clock time at all until packets could arrive faster than 3GHz, or if you figured out a way to make `count++` run slower than packets could arrive.

Usually IO bound means 'kinda like that packet counter'. There are problems with being exactly like that packet counter (e.g. are you using the IO subsystem inefficiently, like reading one char at a time?), but it has the property that speeding up IO speeds up the program, and speeding up your code doesn't speed up the program (much, or at all).

You're right that it isn't a useful comparison between existing programs. It is useful to compare your program's CPU performance to theoretical limits on wall time. When your `wc` implementation approaches the speed of reading a file and doing nothing with it, then you can say it's IO bound. For this reason, there are very few single-file-reading programs that could be described as IO bound. It's common in networking where networks go much slower, and in filesystem traversal (e.g. ripgrep) but not for plain file reading.

Thanks for the thorough response. This aligns with what I had in my head, but it's nice to see a clearer explication and also confirmation that I'm not way out in left field with how I understand things.
People incorrectly use the term “IO bound” or “bottlenecked by IO” without thinking about it all the time.

Like they will talk about how their web app is IO bound because the DB query takes 1 second while their slow ruby code only takes 300ms after it gets the result from the DB back.

Well guess what, making the web app twice as fast still cuts 150ms off the response time, and it still means you can do twice as many requests on the same server.

In order to be able to say that something is “bound” by something else, you have to have some kind of concurrency going on. One task has to be doing all it’s work in the time that it’s waiting for more work to arrive from another.

> So what am I missing here?

The Haskell program is multi-threaded (I don't know about wc's official implementation, but I assume it is as well), while the D program is single-threaded.

GNU coreutils' wc (which is the reference) does not seem to be using threads, no.

At least not at all obviously. No obvious header included and no mention of threads. I only checked the code [1] very quickly, though.

[1] https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;...

I saw that. If adding threads makes a program faster, that implies that more CPU can make it go faster. If the filesystem is already streaming bytes to the program as fast as it can, adding threads just means that you have more threads waiting for bytes.

If adding threads increases the speed, this implies that the program is both CPU-bound and that it has opportunities for parallelism.

So, that sounds like it's CPU bound then...
Maybe the author considers "IO bound" to include "Uses IO inefficiently". If a program uses unbuffered IO one character at a time, it does spend almost all its runtime waiting on IO syscalls.
I think a good test for IO bound is: will it run faster if disk is faster? If yes, then it's IO bound, if not, it means the bottleneck is somewhere else.
I'm confused by this as well. A while ago I needed to count the number of lines in an ununiformly long text file. I was initially using wc but that got too slow so I wrote something that was like wc but only did line counting. I polished it up during a flight to a con and published it on my "blog" [0]. I thought I'd benchmark it to see if these versions beat hacky and gross my C implementation. They don't seem to do so.

My implementation was very close to what I calculated for my hardware's max throughput within 20x the runtime of doing a direct file copy.

For my test file I'm using a copy of the linux source tree in a singe file generated with: `(find linux/ -type f -name ".c" && find linux/ -type f -name ".h") | xargs cat > linux.txt` which is about 770MB of text.

To get an idea of the maximum possible performance I could hope to achieve:

    cat linux.txt | pv > /dev/null
    769MiB 0:00:00 [3.08GiB/s] [   <=>                                                                                                                                                                     
 ]
Some things I noted when doing my testing is that each program counted the number of lines, characters, and words differently in this corpus. I know for a fact that my program's line count is the only thing I tested at all and I'm assuming `wc` from gnu is well tested. Each program returned consistent results. My guess is there's some control/utf8 character or something in the source that isn't playing nice with everything.

First the haskell implementation that uses optics, concurrency, and other magic:

    $ nproc 
    32
    $ time ./hs-wc lazy linux.txt
    24961824 77271007 807304327 linux.txt
    ./hs-wc lazy linux.txt  6.88s user 0.89s system 317% cpu 2.446 total
    $ time ./hs-wc simple linux.txt
    24961824 77270977 807299417 linux.txt
    ./hs-wc simple linux.txt  509.04s user 432.26s system 1850% cpu 50.867 total

Then the D implementation from this post:

    $ time ./d-wc linux.txt 
    24961824 77270980 807299417 linux.txt
    ./d-wc linux.txt  22.09s user 0.14s system 99% cpu 22.238 total
The implementation that comes with ubuntu 19.10:

    $ time wc linux.txt 
    24961824  77270960 807304327 linux.txt
    wc linux.txt  3.59s user 0.08s system 99% cpu 3.672 total
And finally my simple implementation in C:

    $ time ./mine-wc linux.txt 
    24961824 77270966 807304327 linux.txt
    ./mine-wc linux.txt  1.70s user 0.10s system 99% cpu 1.804 total

    $ gcc -O0 wc.c          
    $ time ./a.out linux.txt
    24961824 77270966 807304327 linux.txt
    ./a.out linux.txt  6.51s user 0.12s system 99% cpu 6.636 total

    $ gcc -Wall -Isrc/ -pedantic-errors -Ofast -ftree-vectorize -msse -msse2 -ffast-math wc.c 
    $ time ./a.out linux.txt
    24961824 77270966 807304327 linux.txt
    ./a.out linux.txt  1.37s user 0.09s system 99% cpu 1.467 total

I might be missing something but from my understanding these are not yet bound by my system IO. It might be in the authors use cases however since each disk, system config, etc is different.

[0] - https://closedjdk.com/post/why-is-wc-so-slow/

>I was initially using wc but that got too slow so I wrote something that was like wc but only did line counting.

Any reasonable wc should be fast enough for that purpose; just remember to use the '-l' switch to activate the fast path for line counting.

>Some things I noted when doing my testing is that each program counted the number of lines, characters, and words differently in this corpus.

All programs should report the same line counts (should be exactly the number of line feed characters in the input).

Other than that, it really depends on your current locale and the wc you're using (see https://github.com/expr-fi/fastlwc README for more details). Do note that this D implementation seems to implement the wc character counting behaviour you get with the '-m' switch (instead of the byte counting default).

Also worth noting that different operating systems have different locale definitions; glibc locales explicitly treat non-breaking spaces as non-whitespace characters, while for example Windows doesn't.