| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lor_louis 905 days ago

In utf-8, bytes (uint8_t) may not represent a whole "code point". A code point being an individually meaningful element in utf-8 like a space an 'e' or a modifier code point like an accent or a ZWJ. Most utf-8 libraries will let you address individual code points but it might still garble the text if you split between an 'e' and a '`'. To prevent this, splitting should be done in between graphemes (sequences of code points that render like a single unit*). And even graphemes have their problems.

Very interesting blog post about graphemes that parallels my experience writing a terminal text editor: https://mitchellh.com/writing/grapheme-clusters-in-terminals

* Event that is not a proper description of a grapheme

And don't even get me started on regexes

1 comments

_a_a_a_ 905 days ago

Yes, I understand a little about Unicode in this kind of problem, but a code point is an individual logical item even if it is composed of multiple bytes; being a kind of 'string' in itself. I should have asked more carefully, what would be a better system in your view?

Thanks for the link, will check it out after Christmas.

link

lor_louis 903 days ago

I personally believe that Swift's strings where graphemes are the smallest indexable unit are the gold standard for writing logic that might truncate multilingual text. It's still not perfect though, they add overhead and updates to Unicode might change behaviour so there's that but it should handle most cases gracefully.

link