Hacker News new | ask | show | jobs
by Daniel_Newby 4948 days ago
> Why do you want to count Unicode characters?

Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

> Why do you care if it is fast to do so?

Efficient full text search that can ignore decorative combining characters.

4 comments

> Text editing and rendering

Unless you're working entirely in fixed point characters (and you probably aren't, given that even fixed-width systems like terminal emulators use double-wide glyphs sometimes), you need to know the value of each character to know its width. That involves the same linear scan over the string that is required to calculate the number of glyphs in a variable-width encoding.

If you implement naïve Aho-Corasick text search over one-byte characters, it works without modification on UTF-8 text. It does not ignore combining characters, but UCS-2 also features combining characters (c.f. other comments in this same thread), so no matter what encoding you use, you must first normalize the Unicode text and the search string before you compare for equivalence (or compatibility, which is a looser notion than equality for Unicode code point sequences.)
> Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

Except these parts of the system have to work on unicode glyphs (rendered characters) which will span multiple codepoints anyway, so counting codepoints remains pointless. The only use it has is knowing how much codepoints you have. Yay.

How does fast character counting help with full-text search?
The best search algorithms can skip ahead upon a mismatch. A variable-length encoding requires branch instructions in the inner loop, leading to pipeline flushes and potentially dramatic slow down.
This is incorrect. Searching text with a variable-length encoding does not require extra branch instructions. If you're searching through UTF-8 text, you can just pretend it's a bunch of bytes and search through that.

This isn't counting problems with normalization, of course. You will have to put your needle and haystack both into the same normalization form before searching. But you had to do that anyway.