Hacker News new | ask | show | jobs
by ColinWright 2322 days ago
Reproducing what your code does in the most simple, naive C program possible, I can beat the existing wc utility, taking only around 40% of the time that wc takes.

So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.

Quite the opposite.

So I look forward to a Haskell version that supports everything the wc has so we can get a fair comparison.

1 comments

> So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.

That's precisely what the second part would be about.

And, if I succeed, IMO, that's where Haskell would really shine (because composability and local reasoning), and where I would be able to claim to achieve something — the stuff in the post we're discussing is indeed trivial (I didn't want to say that in the post itself though as I think it'll look like I'm belittling the guy who did the original post), while modularizing this is _fun_.

FWIW, my version, now compiled with -O3, is 8 times faster than wc. And I haven't even tried to optimise it.

I look forward to your results.

Here's the source code for GNU wc - https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;... .

It's easy to see that it's not the result of 10+ years of low-level optimizations to eek out the most performance.

Your test code probably hits the MB_CUR_MAX>1 path at line 361. (Check your locale setting!)

The main loop is:

   402   if (!in_shift && is_basic (*p))
   403     {
   404       /* Handle most ASCII characters quickly, without calling
   405          mbrtowc().  */
   406       n = 1;
   407       wide_char = *p;
   408       wide = false;
   409     }
    ...
   443   switch (wide_char)
   444     {
   445     case '\n':
   446       lines++;
   447       FALLTHROUGH;
   448     case '\r':
   449     case '\f':
   450       if (linepos > linelength)
   451         linelength = linepos;
   452       linepos = 0;
   453       goto mb_word_separator;
   454     case '\t':
   455       linepos += 8 - (linepos % 8);
   456       goto mb_word_separator;
   457     case ' ':
   458       linepos++;
   459       FALLTHROUGH;
   460     case '\v':
   461     mb_word_separator:
   462       words += in_word;
   463       in_word = false;
   464       break;
   465     default:
   466       if (wide && iswprint (wide_char))
                  ....
   480       else if (!wide && isprint (to_uchar (*p)))
   481         {
   482           linepos++;
   483           if (isspace (to_uchar (*p)))
   484             goto mb_word_separator;
   485           in_word = true;
   486         }
   487       break;
   488     }
This is a much more complicated implementation than your code. Among other things, note how it uses isprint/iswprint on each character, and how these are locale dependent.

Even in when character = byte, the main loop uses the same logic:

   555     default:
   556       if (isprint (to_uchar (p[-1])))
   557         {
   558           linepos++;
   559           if (isspace (to_uchar (p[-1]))
   560               || isnbspace (to_uchar (p[-1])))
   561             goto word_separator;
   562           in_word = true;
   563         }
   564       break;
   565     }
Your benchmark only uses the characters:

    "\n\r ',-./0123456789ABCDEFGHIJKLMNOPQRSTUVYZabcdefghijklmnopqrstuvwxyz"
which means it comes nowhere near being a good test which verifies the two programs do the same thing.

The following should be a more difficult test set to reproduce wc's output. I create the test set with Python:

    with open("testset.dat", "wb") as f:
      for b1 in range(256):
        for b2 in range(256):
          for b3 in range(256):
            _ = f.write(bytes((b1, b2, b3)))
then print the output using two different locales:

    % env LANG=en_US.UTF-8 wc testset.dat
      196608 1523713 50331648 testset.dat
    % env LANG=C wc testset.dat
      196608 1152001 50331648 testset.dat
This is a great point about handling printable vs non-printable characters that I originally missed when I read wc code. Thank you for pointing this out!
Could you elaborate on how you read the wc code?

I ask because the essay and your comments until now show no insight from reading the code.

I find it difficult to understand how anyone could miss that (rather significant) part of the core algorithm, and then assert the differences are due to only "modulo intended Unicode space handling" and the like.

Until now I had assumed you had a lay understanding of wc, and had not read the code.

I was mostly curious about how wc handles spaces and whether ignoring non-ascii spaces brings me closer or farther from what wc does. So I focused on that, and this specific printable characters handling didn't caught my eye.

On a meta level, I wasn't even considering that the notion of a word might be different from "a sequence of characters that aren't space characters".

Live and learn indeed.

Thank you for the clarification.