Hacker News new | ask | show | jobs
by throwaway55533 724 days ago
From reading the article, how is this more efficient? Doesn't any word counting algorithm have to iterate through all the characters and count spaces?

What makes this better than the standard algorithm of

    wc = 0
    prev = null
    for ch in charStream:
        if !isSpace(ch) and isSpace(prev):
           wc += 1
        prev = ch
3 comments

Basically, it's just a lookup table for isSpace, rather than whatever logic is in the original function (which probably has conditional branches).

There's a little bit of a complication in the fact that a state machine can implicitly share parts of the computation for utf-8 encoded codepoints with shared prefixes, so instead of a 2^32-element lookup table, you only need 4 2^8-element lookup tables (and instead of 1 state, parsing codepoint, you have 4, parsing first byte, parsing second byte, etc.).

Ah. Is 8 bits really optimal? I don't know how many UTF space characters there are, I thought there were only a few. Why not a 16-bit lookup?
UTF-8 encoded codepoints can have an odd number of bytes, so processing a file 2 bytes at a time would be a little more complicated. Processing the file one codepoint at a time works because you can decode the UTF-8 stream into a stream of 32-bit codepoints before passing the codepoints to the lookup table. I suppose you could also transform from UTF-8 to UTF-16 before passing values to the lookup table.

Processing byte by byte isn't necessarily faster than processing codepoint by codepoint, or any other size. You'd need to measure performance empirically, and it probably depends on caches sizes and other factors. In theory, you could also process bit by bit — then you'd only need 32 2-element lookup tables — but that's unlikely to be efficient, since you'd need to do a lot of bit manipulation.

Edit: Upon inspection, the method I described doesn't appear to be the method used by the featured program. It still basically uses a lookup table for detecting space characters, byte by byte, but the states are not how I described. Instead of the states representing which byte of a UTF-8 encoded codepoint is being processed, and the word count being incremented upon certain transitions — a Mealy machine — the state represent the class of the codepoint last fully processed, and the count is always increased based on the current state (often by zero) — a Moore machine.

It's funny that you call that the standard algorithm, because, although it intuitively makes sense, this is the first time I've seen that word-counting algorithm.

The first implementation of wc I ever saw was the one from The C Programming Language, which has this written at the beginning of the book (except instead of "(isspace(c))" they wrote "(c == ' ' || c == '\n' || c == '\t')"):

  state = OUT;
  nl = nw = nc = 0;
  while ((c = getchar()) != EOF) {
      ++nc;
      if (c == '\n')
          ++nl;
      if (isspace(c))
          state = OUT;
      else if (state == OUT) {
          state = IN;
          ++nw;
      }
  }
You have answered your question yourself: your algorithm looks at each byte twice, not once

It's even more obvious in the UTF case where the classic implementation first looks at 1-4 byte to parse a character and only then checks if it's a space