Hacker News new | ask | show | jobs
by ColinWright 2323 days ago
Not being great at reading Haskell, I have some questions I was hoping people here could answer:

* Does this cope with different whitespace, such as tabs?

* Does this cope with different settings of locale?

* Does this include the option of the "longest line"?

* Does this perform the character counts?

I'm pretty sure wc does all these, and that stripping them out would make it faster. If this Haskell version doesn't do that, and yet still compares against a fully-featured version of wc, the comparison hardly seems fair.

2 comments

* isSpace handles tabs, but looking at a single byte at a time it won't handle all the multibyte space symbols you can have in unicode. If you read further down, they rip out the remains of unicode handling for further speed improvements.

* Looking at a single byte at a time, it presumably only handles the "C" locale :) They don't say what locale GNU wc was tested with (if it's not LANG=C, that benchmark should be re-run)

* --max-line-length? no. But I'm guessing GNU wc isn't benchmarked with that option on (can't find the invocation in the blog post though)

* data State { ws, bs, ls } keeps count of words, bytes (more honest than calling it characters) and lines.

Thanks for the reply ...

> ... further down, they rip out the remains of unicode handling ...

Ah. Well, that makes it a little unfair, surely.

> Looking at a single byte at a time, it presumably only handles the "C" locale ...

Again.

> --max-line-length? no. But I'm guessing GNU wc isn't benchmarked with that option on

I wonder if wc does the work anyway, and only reports it if asked, or if it actually changes the code path if it's not needed.

So this entire post feels ... intellectually dishonest. personally I'm all in favour of Haskell, and I wish I had the chance to use it "in anger" rather than just doing the occasional toy thingie that I do. But this post doesn't do it or its community any favours.

Disappointing.

Yes. "Destroying C". The whole post feels like youthful bravado untempered by experience.
The source code of GNU wc:

https://github.com/coreutils/coreutils/blob/master/src/wc.c

The word counting algorithm does seem to be much more complex.