Hacker News new | ask | show | jobs
by stsp 3552 days ago
Yes, rev(1) probably should handle combined characters.

But those are a property of Unicode, not UTF-8. UTF-8 encodes code points, and we often try to get away without decoding them. Of course the resulting Unicode can change its meaning but it's still valid Unicode (and valid UTF-8).

In some cases we already look at Unicode properties (such as a character's column width). So perhaps we can find a nice way to fix this problem in rev(1), some day.

There are many more interesting Unicode issues we don't address in OpenBSD's UTF-8 support (e.g. han unification, pre-composed vs de-composed normalization).

But we have to start somewhere.

Perhaps, eventually, someone will specify a minimal and sane variant of unicode, which removes all the ambiguities, edge cases, and silly symbols. We'd probably switch over in a heartbeat.

3 comments

What would a minimal and sane variant of Unicode be like? Removing the weird behaviour of Unicode would necessarily mean removing support for some characters, like those that only exist in decomposed form with combining diacritics, and some types of scripts like right-to-left. Mapping code points, characters and graphemes one-to-one seems like it would make text processing easier at the cost of excluding a large portion of the character set.

I guess it would form a middle ground; US-ASCII is also a minimal subset of Unicode where text processing is easy.

Ding ding! Hard things are hard.

It seems... at least a bit arrogant for a developer that doesn't write any of the languages that rely on these features to claim that they're insane and excessive.

You'd switch, lose the ability to convert between unicode and those random 8-bit encoding, and end up having support for the encodings you can't solve to converting to unicode any more.
Once we have a minimal and sane variant of humans without ambiguities, edge cases and silly symbols we can get right on that.