|
|
|
|
|
by throwaway55533
724 days ago
|
|
From reading the article, how is this more efficient? Doesn't any word counting algorithm have to iterate through all the characters and count spaces? What makes this better than the standard algorithm of wc = 0
prev = null
for ch in charStream:
if !isSpace(ch) and isSpace(prev):
wc += 1
prev = ch
|
|
There's a little bit of a complication in the fact that a state machine can implicitly share parts of the computation for utf-8 encoded codepoints with shared prefixes, so instead of a 2^32-element lookup table, you only need 4 2^8-element lookup tables (and instead of 1 state, parsing codepoint, you have 4, parsing first byte, parsing second byte, etc.).