|
|
|
|
|
by lor_louis
905 days ago
|
|
In utf-8, bytes (uint8_t) may not represent a whole "code point". A code point being an individually meaningful element in utf-8 like a space an 'e' or a modifier code point like an accent or a ZWJ. Most utf-8 libraries will let you address individual code points but it might still garble the text if you split between an 'e' and a '`'. To prevent this, splitting should be done in between graphemes (sequences of code points that render like a single unit*). And even graphemes have their problems. Very interesting blog post about graphemes that parallels my experience writing a terminal text editor: https://mitchellh.com/writing/grapheme-clusters-in-terminals * Event that is not a proper description of a grapheme And don't even get me started on regexes |
|
Thanks for the link, will check it out after Christmas.