| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bitwize 1431 days ago
	This never happens. In UTF-8, bit 7 is set in all valid multibyte representations; ASCII characters cannot appear in them. Now, you can have an ASCII character in the middle of a multibyte sequence, but this isn't valid UTF-8 (and should be treated as an error by the decoder). "Words" generated by a splitter would also have invalid UTF-8 if they were split on a space coming in between bytes in a multibyte sequence, but these can detected. Word splitting will not generate invalid words from valid input, or vice versa; and if you know you have valid input, splitting on space, \t, \n, \r, etc. is perfectly cromulent in UTF-8. Now if you want to also split on U+200B (ZERO WIDTH SPACE), U+202F (NARROW NO-BREAK SPACE), etc... hoo boy.