| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by caspper69 498 days ago

Having just gone down this road in C#, the way Unicode is now handled is via "runes".

Each rune may be comprised of various Unicode characters, which may themselves be 1-4 bytes (in the case of utf-8 encoding).

The one problem I have with this approach is that all of the categorization features operate a level below the runes, so you still have to break them up. The biggest drawback is that, at least in my (admittedly limited) research, there is no such thing as a "base" character in certain runes (such as family emojis- parents with kids). You can mostly dance around it with the vast majority of runes, because one character will clearly be the base character and one (or more) will clearly be overalys, but it's not universal.

2 comments

silisili 498 days ago

Go does this too. I generally like the idea a lot, as long as it's consistent. The one thing I don't like is the inconsistency.

Not sure about C#, but in Go for example ranging strings ranges over runes, but indexing pulls a single byte. And len is the byte length rather than rune length.

So basically it's a byte array everywhere except ranging. I guess I would have preferred an explicit cast or conversion to do that instead of by default.

link

stonogo 498 days ago

Runes are how UTF-8 has been handled since its invention. It's just taken some platforms longer to get there than others.

link