| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ChrisSD 1900 days ago

Other people have answered your question but I wanted to clarify one point. The word "character" here means "unicode code point". However, what the user thinks of as a single character can be made up of more than one code point. This presents a different problem and one UTF-8 itself can't help with.

The Unicode Consortium has a report on extended grapheme clusters[0] (i.e. user-perceived characters). Essentially, if you're processing some text mid stream, it might not be clear if a code point is the start of a new user-perceived character or not. So you may want to skip ahead until an unambiguous symbol boundary is reached.

[0]: https://www.unicode.org/reports/tr29/