Hacker News new | ask | show | jobs
by jesuscyborg 2039 days ago
Use wcspbrk.

UTF-8 continuation characters are limited to the range \200 through \300 so there's basically zero chance that if you choose something like comma as your delimiter that it's going to tokenize the middle of a multibyte sequence.

Also take into consideration that, under the hood, functions like strpbrk() are typically accelerated by CPU instructions such as PCMPISTRI which doesn't support UTF-8 natively but it does support UCS-2.

1 comments

> so there's basically zero chance that if you choose something like comma as your delimiter that it's going to tokenize the middle of a multibyte sequence.

Not just "basically;" there is no possible collision between ASCII characters and any valid multibyte encoding. This can be seen somewhat visually in this table[1] and is an intentional aspect of the UTF-8 design.

[1]: https://en.wikipedia.org/wiki/UTF-8#Encoding

How about with joiners and combining characters? Eg. If you encode é as U+0065, U+0301 (\x65\xcc\x81), then search for 'e' and act on the result somehow, you fail to consider the whole glyph.
Sure. You're talking about glyphs that are composed of multiple unicode codepoints; my earlier comment is true of single codepoints only. The comment I was responding to is also talking only about single codepoints (wcspbrk cannot represent delimiters longer than a single codepoint).

On joiners / combining characters: I'd encourage using composed normalization (NFC) rather than decomposed normalization (NFD).

Just curiosity: are there any glyphs that lack a single codepoint representation, where one of the joined codepoints is an ASCII character? (That only helps after normalization, of course.)

Yes. ASCII uses \b as the combining character mark which is a convention that's always been widely supported by typesetting programs such as less and nroff. For example, A\b_ is A̲, and you can do the same thing with apostrophe and tilde for accent marks. There's also UNICODE emojis where two codepoints in sequence get joined together as a single glyph. Never underestimate the creative ways text can be used, or that standards just codify a long history of practices.
Er, I was asking about unicode joining, not this roff \b thing. Sorry for the confusion. I'm aware that multiple-codepoint unicode glyphs exist; I'm asking if any of those involve a codepoint in the ASCII (1-127) range which cannot be normalized to a single codepoint (e.g., e + ' normalizes to a single codepoint é).
Of course. Take for example mͫ (m+m) there's no way to represent that as a single codepoint. Combining marks can also be overlaid multiple times, e.g. m͚ͫ (m+m+∞) so the number of glyphs you can create is limitless. There's only a tiny number of the combinations that are possible which have a tinier normalized form. The new UNICODE combining marks work by almost exactly the same principles as the \b ASCII combining mark. That's why I mentioned it earlier.
In what data format or programming language is 'e' a delimiter? One situation is floating-point constants, where 'e' is a delimiter indicating the exponent. However, if an é occurs in the middle of such a constant, whether as a single code point or a combined character, that is an error. The 'e' must be followed by an optional sign and one or more decimal digits.

The ISO C library string handling stuff is for systems programming, not for scanners and parsers for natural written language.