|
|
|
|
|
by caspper69
498 days ago
|
|
Having just gone down this road in C#, the way Unicode is now handled is via "runes". Each rune may be comprised of various Unicode characters, which may themselves be 1-4 bytes (in the case of utf-8 encoding). The one problem I have with this approach is that all of the categorization features operate a level below the runes, so you still have to break them up. The biggest drawback is that, at least in my (admittedly limited) research, there is no such thing as a "base" character in certain runes (such as family emojis- parents with kids). You can mostly dance around it with the vast majority of runes, because one character will clearly be the base character and one (or more) will clearly be overalys, but it's not universal. |
|
Not sure about C#, but in Go for example ranging strings ranges over runes, but indexing pulls a single byte. And len is the byte length rather than rune length.
So basically it's a byte array everywhere except ranging. I guess I would have preferred an explicit cast or conversion to do that instead of by default.