| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Conlectus 1465 days ago
	What happens when the ASCII character for space is embedded within a multibyte UTF-8 codepoint?

4 comments

kd5bjo 1465 days ago

UTF-8 is designed so that this doesn’t happen: All bytes in a multibyte codepoint have the high bit set, placing them outside of the ASCII range.

link

bitwize 1465 days ago

This never happens. In UTF-8, bit 7 is set in all valid multibyte representations; ASCII characters cannot appear in them. Now, you can have an ASCII character in the middle of a multibyte sequence, but this isn't valid UTF-8 (and should be treated as an error by the decoder). "Words" generated by a splitter would also have invalid UTF-8 if they were split on a space coming in between bytes in a multibyte sequence, but these can detected. Word splitting will not generate invalid words from valid input, or vice versa; and if you know you have valid input, splitting on space, \t, \n, \r, etc. is perfectly cromulent in UTF-8.

Now if you want to also split on U+200B (ZERO WIDTH SPACE), U+202F (NARROW NO-BREAK SPACE), etc... hoo boy.

link

anyfoo 1465 days ago

If UTF-8 allowed that, then a solution would split words wrong according to UTF-8 rules, but it still would not matter, since the problem statement says that "it's okay to only support ASCII for the whitespace handling and lowercase operation".

So if your solution gets word splitting (or lowercasing) wrong for non-ASCII input, it still gets a pass according to my reading.

link

Kiwikwi 1465 days ago

Such "overlong" encodings are not valid UTF-8.

link

tialaramex 1465 days ago

This was down-voted, which isn't really the best way to handle things that are wrong

Your parent was wondering about a hypothetical UTF-8 sequence [all sequences in this post are hexadecimal bytes] XX 20 XX in which the ASCII space character encoded 20 is actually somehow part of a UTF-8 character. That's not a thing. UTF-8 is deliberately designed so that nothing like this can happen, along with several other clever properties, such properties are why UTF-8 took over the world.

Overlong sequences are like E0 80 A0 which naively looks like it's a UTF-8 encoding of U+0020 the space character from ASCII but it's not, because U+0020 is encoded as just 20. These encodings are called "overlong" because they're unnecessarily long, they're transporting a bunch of leading zero bits for the Unicode code point, the UTF-8 design says to reject these.

[ If you are writing a decoder you have two sane choices. If your decoder can fail, report an error, etc. you should do that. If it's not allowed to fail (or perhaps there's a flag telling you to press on anyway) each such error should produce the Unicode code point U+FFFD. U+FFFD ("The Replacement Character") is: Visibly obvious (often a white question mark on a black diamond); Not an ASCII character, not a letter, number, punctuation, white space, a magic escape character that might have meaning to some older system, filesystem separator, placeholder or wildcard, or any such thing that could be a problem. Carry on decoding the rest of the supposed UTF-8 after emitting U+FFFD. ]

link

anonymoushn 1465 days ago

I would expect a wordcount to pass these through as-is, along with any other invalid sequences of bytes with the high bits set. Performing unexpected conversions just seems like a way to make my future self have a bad time debugging.

link