| Yes, rev(1) probably should handle combined characters. But those are a property of Unicode, not UTF-8. UTF-8 encodes code points, and we often try to get away without decoding them. Of course the resulting Unicode can change its meaning but it's still valid Unicode (and valid UTF-8). In some cases we already look at Unicode properties (such as a character's column width). So perhaps we can find a nice way to fix this problem in rev(1), some day. There are many more interesting Unicode issues we don't address in OpenBSD's UTF-8 support (e.g. han unification, pre-composed vs de-composed normalization). But we have to start somewhere. Perhaps, eventually, someone will specify a minimal and sane variant of unicode, which removes all the ambiguities, edge cases, and silly symbols. We'd probably switch over in a heartbeat. |
I guess it would form a middle ground; US-ASCII is also a minimal subset of Unicode where text processing is easy.