Hacker News new | ask | show | jobs
by notacoward 2322 days ago
This doesn't seem to be comparing anything like the same thing. Does the Haskell version really do the same thing as the C version? Does it handle all of the same error cases, providing the same quality of error messages if they occur? Does it handle localization? If not, that makes the comparison very skewed, as unhammer already pointed out. Sure, if you strip out all of the things that the people who wrote wc actually spent all that time on then you can go faster, but failing to note the differences is simply dishonest.
8 comments

Sup, author here.

> Does the Haskell version really do the same thing as the C version?

It counts bytes, words and lines and, modulo intended Unicode space handling and unintended bugs, does the same thing.

Indeed, it does not count things like max line length or char count, but those can be plugged in without significant performance overhead (and that's what the second part is gonna be about).

> Does it handle all of the same error cases, providing the same quality of error messages if they occur?

There are no error messages at this point. Although I don't really see how this should affect performance.

> Does it handle localization?

If you mean counting multi-byte characters, then not yet. Although I'm pretty convinced it does not require doing something much more complicated — but maybe I'm wrong, we'll see in the next part.

Also, thanks for the feedback, those are important questions! Something to keep in mind when writing subsequent posts.

Reproducing what your code does in the most simple, naive C program possible, I can beat the existing wc utility, taking only around 40% of the time that wc takes.

So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.

Quite the opposite.

So I look forward to a Haskell version that supports everything the wc has so we can get a fair comparison.

> So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.

That's precisely what the second part would be about.

And, if I succeed, IMO, that's where Haskell would really shine (because composability and local reasoning), and where I would be able to claim to achieve something — the stuff in the post we're discussing is indeed trivial (I didn't want to say that in the post itself though as I think it'll look like I'm belittling the guy who did the original post), while modularizing this is _fun_.

FWIW, my version, now compiled with -O3, is 8 times faster than wc. And I haven't even tried to optimise it.

I look forward to your results.

Here's the source code for GNU wc - https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;... .

It's easy to see that it's not the result of 10+ years of low-level optimizations to eek out the most performance.

Your test code probably hits the MB_CUR_MAX>1 path at line 361. (Check your locale setting!)

The main loop is:

   402   if (!in_shift && is_basic (*p))
   403     {
   404       /* Handle most ASCII characters quickly, without calling
   405          mbrtowc().  */
   406       n = 1;
   407       wide_char = *p;
   408       wide = false;
   409     }
    ...
   443   switch (wide_char)
   444     {
   445     case '\n':
   446       lines++;
   447       FALLTHROUGH;
   448     case '\r':
   449     case '\f':
   450       if (linepos > linelength)
   451         linelength = linepos;
   452       linepos = 0;
   453       goto mb_word_separator;
   454     case '\t':
   455       linepos += 8 - (linepos % 8);
   456       goto mb_word_separator;
   457     case ' ':
   458       linepos++;
   459       FALLTHROUGH;
   460     case '\v':
   461     mb_word_separator:
   462       words += in_word;
   463       in_word = false;
   464       break;
   465     default:
   466       if (wide && iswprint (wide_char))
                  ....
   480       else if (!wide && isprint (to_uchar (*p)))
   481         {
   482           linepos++;
   483           if (isspace (to_uchar (*p)))
   484             goto mb_word_separator;
   485           in_word = true;
   486         }
   487       break;
   488     }
This is a much more complicated implementation than your code. Among other things, note how it uses isprint/iswprint on each character, and how these are locale dependent.

Even in when character = byte, the main loop uses the same logic:

   555     default:
   556       if (isprint (to_uchar (p[-1])))
   557         {
   558           linepos++;
   559           if (isspace (to_uchar (p[-1]))
   560               || isnbspace (to_uchar (p[-1])))
   561             goto word_separator;
   562           in_word = true;
   563         }
   564       break;
   565     }
Your benchmark only uses the characters:

    "\n\r ',-./0123456789ABCDEFGHIJKLMNOPQRSTUVYZabcdefghijklmnopqrstuvwxyz"
which means it comes nowhere near being a good test which verifies the two programs do the same thing.

The following should be a more difficult test set to reproduce wc's output. I create the test set with Python:

    with open("testset.dat", "wb") as f:
      for b1 in range(256):
        for b2 in range(256):
          for b3 in range(256):
            _ = f.write(bytes((b1, b2, b3)))
then print the output using two different locales:

    % env LANG=en_US.UTF-8 wc testset.dat
      196608 1523713 50331648 testset.dat
    % env LANG=C wc testset.dat
      196608 1152001 50331648 testset.dat
This is a great point about handling printable vs non-printable characters that I originally missed when I read wc code. Thank you for pointing this out!
Could you elaborate on how you read the wc code?

I ask because the essay and your comments until now show no insight from reading the code.

I find it difficult to understand how anyone could miss that (rather significant) part of the core algorithm, and then assert the differences are due to only "modulo intended Unicode space handling" and the like.

Until now I had assumed you had a lay understanding of wc, and had not read the code.

Good point. I was thinking the same thing too.

I will wait for what the next article presents, as the end of this article states:

> Stay tuned for a second part, where we will investigate shaping all this into an actual wc substitute, where different statistics can be turned on or off, all while not computing what the user hasn’t requested.

> This doesn't seem to be comparing anything like the same thing.

This is a fair point, and I believe this whole series of 'Beating C with foo' posts could have been better named.

But I'm of the opinion that the whole series is about showcasing various language's strengths and weaknesses, while using GNU wc as a benchmark.

From this perspective, I've learnt a bit about several languages I knew nothing about, so I rather like these articles, despite the fact that their titles might be a bit misleading.

It's all about tone.

Everyone remembers being a brash know-nothing teenager, so just by the headline "DESTROYING C" you get that vibe.

I remember having a blog about Haskell circa 2002 where I would smugly enumerate the ways in which Guido van Rossum was wrong. "GUIDO IS WRONG! PART 4" my post titles would say.

He's not handling unicode and it's been well demonstrated in the past that unicode handling adds a lot to computation time in tests like these (it often comes up in grep benchmarks too).

I'd be interested to see the same tests but with a non-unicode local set, IIRC:

    LC_ALL=C
I'd wager the C version would perform much better in that test.
Don't forget to export the variable, or run like "LC_ALL=C wc ...", or otherwise the locale won't apply.
Does it handle multiple files, and stdin?

Anyway, destroying Haskell with 100 lines of C :-)

https://raw.githubusercontent.com/gdevic/minix1/master/comma...

I see and raise, to one line:

https://www.ioccc.org/2019/burton/prog.c

It is the Burton entry from The International Obfuscated C Code Contest, but seems like only handling ascii

https://www.ioccc.org/years-spoiler.html

Limits:

"Requires the C locale and ASCII character set. Input should be less than ten million octets to avoid this problem."

https://www.ioccc.org/2019/burton/hint.html

Isn't 100 lines too much for a simple utility like wc? I get that it has many edge cases to cover, but edge cases usually require some different handling when you run into them anyway. I'd rather use my one-liner (including calculation in a single pass, and even parallel processing!):

    wc:{sum (({1};ceiling 0.5*sum differ " "=;{1+count x})@\:) peach read0 x}
Do you get a speed-up if you use getc_unlocked() instead of getc()? And if you write your own isspace()? As far as I know, isspace() is locale sensitive.
isspace is defined as a macro at the top of the file, and it's definitely not locale sensitive ;)
Off-topic but tangentially related. There was Reddit post[0] about a clothes dryer's eco mode being a scam since it consumes more power than a regular drying cycle. Here[1] is a single comment thread noting that the user was measuring power output of the entire house. The user quickly changed their tune when this was highlighted.

[0]: https://old.reddit.com/r/dataisbeautiful/comments/eyevca/my_...

[1]: https://old.reddit.com/r/dataisbeautiful/comments/eyevca/my_...

They explicitly say in the introduction that this is a toy version, and then in the conclusion that a future post will look at a more complete substitute.
> failing to note the differences is simply dishonest.

Like yourself, I think any reader could understand that this is meant to be a toy implementation. The author even says so in the very first paragraph and at several points in the article thereafter. How is that dishonest?

Because that makes the programs sufficiently different that the conclusion about "smashing" seems dishonest.

Apples smash Oranges, etc.