| HN Mirror

I am the parent you are talking about. I've made this argument many times with people: Unicode is crazy complicated in any programming language. People think that widening the char width will help - well you seem to be somebody who knows Unicode so you probably know the horrors of surrogates, combining characters vs. pre-composed diacritics, zero-width joiners, Han unification, variation selectors, BiDi... This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above. They all punt the issue to the programmer.

I've heard (mostly here) that Swift does something different and treats glyphs as the basic unit. I haven't had a chance to look at precisely what that does. Given all the issues I've seen elsewhere I'm skeptical that someone, anyone can pull that off correctly.

UTF-8 at least has one elegance (there's that word again) in the design in that you can do some "dumb" ASCII things and if your code does not know what to do with fancy unicode, you can check the high bit of any given octet and "safely" skip over it and any adjacent nonascii sequence if you don't know what it means. This may or may not be applicable to a task at hand.