Hacker News new | ask | show | jobs
Wc in D: 712 Characters Without a Single Branch (dlang.org)
132 points by aldacron 2329 days ago
12 comments

Perhaps I'm just being pedantic or maybe I am misunderstanding.

The author claims to be IO bound toward the end. But they are comparing to two versions that are faster.

It is my understanding that IO-bound means that the IO subsystem is the thing which limits run time of the program. But the author clearly demonstrates that the IO subsystem of their machine is capable of supporting faster wc binaries.

So what am I missing here?

You are pretty much right. It isn't IO bound.

If your definition said IO must be the only thing limiting the time, then few programs would be IO bound, except trivial ones like "count arriving packets". In a packet counter, your "implementation" would not affect wall clock time at all until packets could arrive faster than 3GHz, or if you figured out a way to make `count++` run slower than packets could arrive.

Usually IO bound means 'kinda like that packet counter'. There are problems with being exactly like that packet counter (e.g. are you using the IO subsystem inefficiently, like reading one char at a time?), but it has the property that speeding up IO speeds up the program, and speeding up your code doesn't speed up the program (much, or at all).

You're right that it isn't a useful comparison between existing programs. It is useful to compare your program's CPU performance to theoretical limits on wall time. When your `wc` implementation approaches the speed of reading a file and doing nothing with it, then you can say it's IO bound. For this reason, there are very few single-file-reading programs that could be described as IO bound. It's common in networking where networks go much slower, and in filesystem traversal (e.g. ripgrep) but not for plain file reading.

Thanks for the thorough response. This aligns with what I had in my head, but it's nice to see a clearer explication and also confirmation that I'm not way out in left field with how I understand things.
People incorrectly use the term “IO bound” or “bottlenecked by IO” without thinking about it all the time.

Like they will talk about how their web app is IO bound because the DB query takes 1 second while their slow ruby code only takes 300ms after it gets the result from the DB back.

Well guess what, making the web app twice as fast still cuts 150ms off the response time, and it still means you can do twice as many requests on the same server.

In order to be able to say that something is “bound” by something else, you have to have some kind of concurrency going on. One task has to be doing all it’s work in the time that it’s waiting for more work to arrive from another.

> So what am I missing here?

The Haskell program is multi-threaded (I don't know about wc's official implementation, but I assume it is as well), while the D program is single-threaded.

GNU coreutils' wc (which is the reference) does not seem to be using threads, no.

At least not at all obviously. No obvious header included and no mention of threads. I only checked the code [1] very quickly, though.

[1] https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;...

I saw that. If adding threads makes a program faster, that implies that more CPU can make it go faster. If the filesystem is already streaming bytes to the program as fast as it can, adding threads just means that you have more threads waiting for bytes.

If adding threads increases the speed, this implies that the program is both CPU-bound and that it has opportunities for parallelism.

So, that sounds like it's CPU bound then...
Maybe the author considers "IO bound" to include "Uses IO inefficiently". If a program uses unbuffered IO one character at a time, it does spend almost all its runtime waiting on IO syscalls.
I think a good test for IO bound is: will it run faster if disk is faster? If yes, then it's IO bound, if not, it means the bottleneck is somewhere else.
I'm confused by this as well. A while ago I needed to count the number of lines in an ununiformly long text file. I was initially using wc but that got too slow so I wrote something that was like wc but only did line counting. I polished it up during a flight to a con and published it on my "blog" [0]. I thought I'd benchmark it to see if these versions beat hacky and gross my C implementation. They don't seem to do so.

My implementation was very close to what I calculated for my hardware's max throughput within 20x the runtime of doing a direct file copy.

For my test file I'm using a copy of the linux source tree in a singe file generated with: `(find linux/ -type f -name ".c" && find linux/ -type f -name ".h") | xargs cat > linux.txt` which is about 770MB of text.

To get an idea of the maximum possible performance I could hope to achieve:

    cat linux.txt | pv > /dev/null
    769MiB 0:00:00 [3.08GiB/s] [   <=>                                                                                                                                                                     
 ]
Some things I noted when doing my testing is that each program counted the number of lines, characters, and words differently in this corpus. I know for a fact that my program's line count is the only thing I tested at all and I'm assuming `wc` from gnu is well tested. Each program returned consistent results. My guess is there's some control/utf8 character or something in the source that isn't playing nice with everything.

First the haskell implementation that uses optics, concurrency, and other magic:

    $ nproc 
    32
    $ time ./hs-wc lazy linux.txt
    24961824 77271007 807304327 linux.txt
    ./hs-wc lazy linux.txt  6.88s user 0.89s system 317% cpu 2.446 total
    $ time ./hs-wc simple linux.txt
    24961824 77270977 807299417 linux.txt
    ./hs-wc simple linux.txt  509.04s user 432.26s system 1850% cpu 50.867 total

Then the D implementation from this post:

    $ time ./d-wc linux.txt 
    24961824 77270980 807299417 linux.txt
    ./d-wc linux.txt  22.09s user 0.14s system 99% cpu 22.238 total
The implementation that comes with ubuntu 19.10:

    $ time wc linux.txt 
    24961824  77270960 807304327 linux.txt
    wc linux.txt  3.59s user 0.08s system 99% cpu 3.672 total
And finally my simple implementation in C:

    $ time ./mine-wc linux.txt 
    24961824 77270966 807304327 linux.txt
    ./mine-wc linux.txt  1.70s user 0.10s system 99% cpu 1.804 total

    $ gcc -O0 wc.c          
    $ time ./a.out linux.txt
    24961824 77270966 807304327 linux.txt
    ./a.out linux.txt  6.51s user 0.12s system 99% cpu 6.636 total

    $ gcc -Wall -Isrc/ -pedantic-errors -Ofast -ftree-vectorize -msse -msse2 -ffast-math wc.c 
    $ time ./a.out linux.txt
    24961824 77270966 807304327 linux.txt
    ./a.out linux.txt  1.37s user 0.09s system 99% cpu 1.467 total

I might be missing something but from my understanding these are not yet bound by my system IO. It might be in the authors use cases however since each disk, system config, etc is different.

[0] - https://closedjdk.com/post/why-is-wc-so-slow/

>I was initially using wc but that got too slow so I wrote something that was like wc but only did line counting.

Any reasonable wc should be fast enough for that purpose; just remember to use the '-l' switch to activate the fast path for line counting.

>Some things I noted when doing my testing is that each program counted the number of lines, characters, and words differently in this corpus.

All programs should report the same line counts (should be exactly the number of line feed characters in the input).

Other than that, it really depends on your current locale and the wc you're using (see https://github.com/expr-fi/fastlwc README for more details). Do note that this D implementation seems to implement the wc character counting behaviour you get with the '-m' switch (instead of the byte counting default).

Also worth noting that different operating systems have different locale definitions; glibc locales explicitly treat non-breaking spaces as non-whitespace characters, while for example Windows doesn't.

Gave myself a quick D lesson just to understand the approach to flags here (Yes.keepTerminator rather than just a meaningless bool). Turns out D lets you define a template with any args you like, in this case taking a string for a name of a Flag type, which in turn contains an enum with 'yes' and 'no' boolean values. This means that you can only use the right type of Flag, with a matching name, and its yes/no value. But the syntax is a bit icky (Flag!"keepTerminator) and so _another_ nice feature of D appears to be that you can intercept field dispatch in a struct. And so the 'Yes' struct does this, captures the 'keepTerminator' as a string, and creates the correct type of flag.

For whatever reason I found all this rather cute (and apologies to any actual D programmers if I've misread this whole situation).

Spot on! :) With the help of opDispatch (the catch-all member function temlate), it's possible to drop the string from the use site. (I don't know a way of dropping it from the type name.)

  import std.stdio;
  import std.typecons;
  import std.string;
  
  // This type's opDispatch removes the need for string in the flag name.
  struct FlagFromBool {
    auto opDispatch(string flagName)(bool value) {
      mixin (format!q{
        return value ? Yes.%s : No.%s;
      }(flagName, flagName));
    }
  }
  
  // A convenience function to remove the need for empty struct construction parenthesis.
  auto flagFromBool() {
    return FlagFromBool();
  }
  
  // Unfortunately, the type name still requires string flag names:
  void bar(Flag!"foo" flag) {
    writeln("called with ", flag);
  }
  
  void main() {
    // However, the expressions don't need a string:
    bar(flagFromBool.foo(false));
    bar(flagFromBool.foo(true));
  }
Both your convenience function and the quotes in the type name can be removed via the use of static opDispatch:

struct FlagImpl(string name) { bool value; alias value this; }

struct Flag { alias opDispatch(string name) = FlagImpl!name; }

struct Yes { static auto opDispatch(string name)() { return FlagImpl!name(true); } }

struct No { static auto opDispatch(string name)() { return FlagImpl!name(false); } }

void fun(Flag.foo a) {} // Look ma, no quotes!

unittest { fun(Yes.foo); fun(Flag.foo(true)); }

D is getting named function arguments, which will make the "meaningless bool" a bit more meaningful.
How will that work with c compatibility?
That's a very good question! When connecting with C code, the C functions being called will need to conform to O/B principles as far as the interface to the function is concerned. It's the same issue as calling unsafe code - the compiler can't check it, the programmer has to.

But if you use D as a BetterC compiler, then the D compiler can check the O/B rules in the code.

Functions that are O/B checked are marked with the `@live` attribute, meaning that you'll be able to incrementally use O/B in the code. This is much like how you can use the `pure` attribute to incrementally write functionally pure code.

What's O/B?

Google took me right back to your comment.

Walter got a bit confused here - what he's describing is the owner/borrow system he's proposed to compete with Rust. This is unrelated to named parameters.
"without a single branch" I thought it meant actually a branchless version of wc. Turns out it's just no explicit if statements.
If you were careful, it seems pretty plausible to be able to use some indexing tricks and CMOV to write a jump/branchless version of wc. You are basically counting newlines and runs of whitespace.
Yeah, it's Nnt just plausible, it's fairly straightforward. Here are some untested versions that have branch-free inner loops for lines (easiest, this will probably compile branch-free even if you aren't careful), and words:

https://godbolt.org/z/JNU-_G

CMOV aside, I remember it was proved MOV itself is turing-complete.
slightly related, does embedding arithmetic in mov instructions go faster than explicit ALU operations ?
Depending on the arithmetic, it seems that yes it can! I've noticed gcc using LEA instructions for arithmetic of the form (x * a + b), where 'a' and 'b' fit with the instruction.
Using LEA (load effective address) for calculation seems to be pretty common in most programs for both gcc and llvm. You basically get smaller code, and it used to schedule them better across more execution ports. Not sure if the CPU can fuse the ADD+MUL now.
"Counting newlines" implies branching.
You could subtract the value of '\n' from each byte, bitwise-or the result with its negated value, right shift the sign bit into the least significant position, and push that into an accumulator.

Basically, it is possible to reduce a byte into a one or a zero if it does or does not match a value using only bitwise instructions. I don't know if the way I just described uses the fewest instructions, but you can definitely do it without branching.

In fact, you can do this to test any integer comparison. Start by subtracting the arguments to the comparison operator, and then use one of these "squash" functions for the comparison operator in question (assuming 32 bit signed integers are being compared):

== : ~(c | -c) >>> 31

!= : (c | -c) >>> 31

> : -c >>> 31

>= : ~c >>> 31

< : c >>> 31

<= : ~-c >>> 31

Not really:

   cnt += *ptr++ == '\n'
Should compile down to no branches (not even cmov) by summing the value of the flag register directly. Would be a it hard to stop without taking a branch though. Would you consider function pointer calls a branch? If that's too easy, what about taking a segfault?
Even if you do the branchiest thing possible in the source, no bitwise or "looks cmov-y" stuff, all compilers that matter will do this without a branch with optimization on:

https://godbolt.org/z/XCBGYW

Only icc vectorizes this at -O2 (still branchless, of course), but clang and gcc vectorize it if you go to -O3.

Sure, but where's the code golfing fun in there ;)
Almost everyone in tech, it seems, interprets “without X” as meaning “without explicit X :(
I believe I'm missing something here or my day was too long, but in Clojure this can be squeezed in 13 lines and 435 characters keeping things fairly readable (for Clojure & Lisp developers ;)).

    (defn wc [^String file]
      (with-open [rdr (clojure.java.io/reader file)]
        (apply (partial printf "%d %d %d\n")
               (reduce
                (fn [[nl nw nb] ^String ln]
                  (let [words (count (.split ln "[ ]+"))
                        bytes (alength (.getBytes ln "UTF-8"))]
                    [(inc nl) (+ nw words) (+ nb bytes)]))
                [0 0 0]
                (line-seq rdr)))))
    
    (defn -main [& args]
      (wc (first args)))

I haven't tested how fast it is, but startup time can be optimized by compiling it with GraalVM.
Neat! I wonder if the character decoding & regex usage has noticeable performance impact. My Common Lisp version was sped up somewhat by switching from a character stream to a byte stream: https://git.tazj.in/tree/fun/wcl/wc.lisp

You can try this one via Nix with:

  nix-build -E '(import (builtins.fetchGit "https://git.tazj.in") {}).fun.wcl'
It's been a while since I last wrote CL but I think your program can produce any counts between zero and the correct one – you need to use eql instead of eq if you want this to work in standard common lisp.
Ah, you're right of course - fixed. Though that still won't make this portable, as I'm using an SBCL-specific way of accessing argv.
`unix-opts` is in available in quicklisp -- it's small, and has a portable `argv` wrapper function.
That's a very impressive Nix configuration. Nicely done!
Since we're golfing, here's a 51-character version in fairly readable J:

    wc=:[:+/[:(1:,([:#'\S+'&rxmatches),>:@#);._2 freads
I'm guessing that a real J wizard could squeeze out a few more characters. :)
There's also a wc in rust (of course) with more code (120 lines) but quite more efficient: https://medium.com/@martinmroz/beating-c-with-120-lines-of-r...
In the blog post you linked, a library for parallelism is used.
It's an additional two lines of code when they add it (and probably one more to pull it in above), and only happens at the end of the actual work, after they've matched C's performance and beat C's memory footprint. Using the parallelism library at the very end hardly invalidates the rest of the exercise.
A couple of points with comparing coreutils,

* recompiling with -march=native can give significant wins over more generic binaries provided by linux distros.

* parallel processing helps with bigger files, and it's easy enough to leverage the existing wc binary to process in parallel.

Both points are discussed at: https://www.pixelbeat.org/docs/unix-parallel-tools.html

I'm wavering between D and Rust for which one could be used as a better alternative to C++. (Though I don't see C++ getting replaced anytime soon.) Even if Rust is getting most of the attention, D also seems to be a strong candidate.
I've been using Rust for a year or so and while it's very nice once you get used to it I recently started working in D since I feel _so_ much more productive in it thanks to the GC while it still feels powerful and is fast where it matters (you can avoid the GC at performance critical places). The lack of libraries is a little annoying at times but thanks to dstep it's somewhat easy to use C stuff.
D is also getting an Ownership/Borrowing system:

https://github.com/dlang/dmd/pull/10747

D is fun! It's so easy to get started. Rust has an initial learning curve that just doesn't doesn't seem to fun to me, but I guess I'll have to actually scale it some day.
I'd say the main determinant is what kind of C++ are you writing and what features of C++ are most important for you. If template metaprogramming is your kind of thing, then you absolutely need to try D, D makes heavy use of template metaprogramming (yes, the lengthy errors you know from C++ make a comeback in D too). If your C++ is more like C with classes, you'll feel right at home in D too.

However, if the most important features of C++ is RAII, shared pointers, unique pointers, move semantics, then you'd probably enjoy Rust more, because it focuses on this kind of memory/resource management techniques.

Rust is definitely more difficult but it has it benefits, I guess? For me as a Python guy D was just easier both concept wise and syntax wise. I could write a relatively complex algorithm after 2 weeks of reading a D programming book with just standard ops. And it was definitely faster. Maybe not as fast as C but I felt efficient. Personally, I liked that D does pray FP like Scala while also being a multiparadigm language. Aaand it is definitely more readable.
*does not pray FP
Fairly elegant code. Using a library that splits the line into words makes it pretty simple.
This is ridiculous: of course there are branches, but you don’t explicitly write them. This is purely aesthetic.

Edit: i simply wish the author illustrated why this is good or desirable—conditionals are not difficult to read.

The aesthetic aspect is insignificant, it's about the semantics -- and thus reasoning about the code and other such properties. So nothing ridiculous about it.

It's like Haskell code "of course has" anything C has, as underneath the both run assembly instructions full of gotos and state manipulation, you just "don't explicitly write it".

Do you have difficulty reasoning through conditionals?

Meanwhile the semantics of objects, especially one named “Output”, are super obvious and easy to reason about.

Branches aren’t hard to reason about though. The only* reason anyone cares about branches is that branch misprediction is expensive.
I think branches are actually harder to reason about, we're just used to doing it since all of us have done it for so long. However I do believe that branches are less straightforward than "linear" code and add to mental complexity on larger projects.

Of course there is no data to back this up, but I think one of the next trends in programming beyond the adoption of functional styles, immutable data, "functional core / imperative shell" will be abstracting away from explicit conditional/branching logic in higher level code.

Obviously people can come up with pedantic/extreme cases where the abstraction does nothing to hide the complexity, or even makes things more complex but im not taking about that. I mean more simple abstractions like what was used in OP, or a filter() abstracting over a while and if combo.

I'm convinced based on personal experience that it makes for cleaner code and will become more widely adopted over the years as people explore it.

How is it possible to reason about filter() without thinking it through as effectively a conditional? How is that easier than being explicit about it?
Why is separating pure functional code from imperative / side effect causing code easier to reason about? You still have to have the side-effect somewhere.

Why are immutable data structures easier to reason about? They still produce the same results.

None of the above are proven facts, but they are generally agreed upon principles that a number of people have observed. And I happen to believe them as well.

Overall my opinion of why it's easier is that like most optimizations it reduces the number of states/cases you have to think about in the "common path". With a solid abstraction you rarely need to think about the internals, with explicit code you need to check and think through all of the edge cases.

filter() (or map, or ...) is probably the best simple case I can think about. There is more room for bugs in an explicit (ex. for (int = 0; i < ...) {}) than in a filter over a collection / enumerable. When reading the code you need to think through base and termination conditions as well as confirm aggregation is happening correctly. With a filter() you don't have to do it / be explicit about it. Obviously you still need to know what filter means (which means you know it's implementation) but you don't need to focus on the details, only the filter condition: the signal amid the noise. The rest is pushed behind a solid reusable abstraction. Again thinking about it in isolation if someone just introduced the filter function to you you might look at it an say it really doesn't add much and if anything obscures my code I don't see the value. But after adopting it, and it's related family of monadic collection operations, code can now be written at a much denser, fewer off-by-one errors, higher signal to noise, level of abstraction. It's provided structure and common abstractions to unstructured iteration. To the point you begin to think in these abstractions (which I think is a benefit).

Another example are parser combinators. There is nothing that parser combinators do that can't be done by hand. But one of the primary way they simplify coding is by hiding away conditional (and looping logic). "?*.." , | and & concepts. Parser combinators also benefit in simplifying things by building an algebra for users to work in with nice closed operations. But again that only became possible by abstracting away the conditional and looping logic and making these nice simple abstractions that can be easily combined.

So my summary is, my view is that abstracting over control flow code does a lot to simplify logic. A lot of logic involving looping has already been abstracted over and included in languages and libraries (in general "functional programming" styles). Now that the low hanging fruit has been incorporated I think next will be the next layer of non-looping conditional logic.

Why do you consider a conditional (I have no clue why you call it a “branch”, a semantically meaningless term at the language level) hard to reason about?
I find D easier to grok coming from a Java/Python background.
I've only toyed with D a bit. IMHO, if you come from a typical OO programming language background (which to me includes C++, Java, Python...), the majority of D will immediately feel familiar, and to a great extent, obvious.

The syntax is familiar and the ideas are familiar. You don't have to learn anything new (not immediately anyway) like you do in Rust (where you basically need to learn EVERYTHING new).

Where this idyllic scenario starts falling apart with when you start actually using it for anything half-serious. Some of the bits feel extremely unintuitive and the documentation is difficult to navigate. There are few examples and the tutorial is a bit spartan. For example, I needed a deque-like container (double-ended queue), but it took me ages to figure out that a) the language actually has one and b) how to use the bloody thing.

There is also a bit of schizophrenia going on, with the "new" ideas and the "old" ideas clashing in some places. For example, they claim that you can run D without a GC (the new), but apparently a good chunk of the stdlib requires the GC (the old), so you're stuck.

I find this all to be unfortunate because D, to me, feels like it could be a better, saner C++.

> There is also a bit of schizophrenia going on, with the "new" ideas and the "old" ideas clashing in some places. For example, they claim that you can run D without a GC (the new), but apparently a good chunk of the stdlib requires the GC (the old), so you're stuck.

AFAIK this is somewhat intentional; they don't want to make any hard compatibility breaks, so there's a long deprecation period for any 'old' idea. There's also a lack of manpower to renovate libraries; e.g. there's no good xml library.

Regarding GC, it's IMO not a huge problem. The GC is really not a problem for most applications, and for those where it is, you can simply avoid GC allocations in inner loops (GC only runs when you allocate from it).

> they claim that you can run D without a GC (the new), but apparently a good chunk of the stdlib requires the GC (the old), so you're stuck.

The intent isn't to turn off the GC completely (though GC-averse folks assume that it is). The `@nogc` function attribute is intended to be applied where you need it. Then you can guarantee that in that function's call stack, no language features that require the GC will be used.

The standard library has been retrofitted to eliminate use of the GC where it isn't needed and provide alternatives where possible (such as a function that takes a buffer as an argument alongside one that allocates). There may still be places where it can be trimmed down even more, but it will never be fully `@nogc` compatible.

D is meant to be used with the GC, but provides the means to avoid allocations, turn collections on/off (`GC.disable/enable`) and command line options for profiling GC usage and affecting its behavior. Anyone who wants to turn off the GC completely is going beyond the primary intended use case and is of course going to run into bumps with the standard library. Much of it is still usable, though.

See https://dlang.org/blog/the-gc-series/

I feel the same way too. It's fairly easy to translate Python into D, and you get all of that compile-time goodness to go with it (static type checking and compile-time function execution). And it's pretty fast! I bet we could optimise this D wc to match GNU's wc without too many crazy tricks.
D is an amazing language. But what eventually put me off is the lack of documentation for installing and linking libraries in the compilers.
This is a bit perplexing. Libraries don't need to be installed, you simply put them in a directory, add a path to it in the command to the compiler, and list the library on the command to the compiler.

Just as you would for C and C++.

Other languages may need libraries to be installed, but not D.

Just like in Java
My only issue is that it's far more complicated than Java and for a language that fills that same niche it's hard for me to justify putting my resources into learning it even though I do like a lot of the features.
It's certainly possible to write more complicated code in a language like D than in Java. I personally find the verbosity of Java combined with jamming everything into a single approach makes Java hard to read.
Java is indeed a simpler language, much simpler. Unfortunately, my experience with Java is it is too simple - I had to write way too much boilerplate over and over.
D is designed to be able to do 100% of everything C++ can do. "Alias this" is intended to replace C++'s implicit conversion through multiple inheritance pattern and its "function that returns a value with implicit conversion" pattern. "Mixin templates" are a way to do CRTP.
In what way is more complicated than Java?
Java is actually a simple language.
well, java isn't really complicated. it's just verbose. but my understanding is that d has many of the features of c and modern cpp. that alone makes it more complicated than java...
You can also write verbose code in D. The language doesn't prohibit it.
I'm no D expert but this seems to assume one file to count given as one argument on the command line. The wc in coreutils takes any number of arguments or will happily count STDIN.
Every time I see a post with some D source, I think the language looks great; but for some reason I never try to learn it...
That's ok, one of these days you'll try it and then wonder what took you so long :-)
Start playing with no setup:

https://run.dlang.io/