Hacker News new | ask | show | jobs
by asveikau 3371 days ago
I am the parent you are talking about. I've made this argument many times with people: Unicode is crazy complicated in any programming language. People think that widening the char width will help - well you seem to be somebody who knows Unicode so you probably know the horrors of surrogates, combining characters vs. pre-composed diacritics, zero-width joiners, Han unification, variation selectors, BiDi... This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above. They all punt the issue to the programmer.

I've heard (mostly here) that Swift does something different and treats glyphs as the basic unit. I haven't had a chance to look at precisely what that does. Given all the issues I've seen elsewhere I'm skeptical that someone, anyone can pull that off correctly.

UTF-8 at least has one elegance (there's that word again) in the design in that you can do some "dumb" ASCII things and if your code does not know what to do with fancy unicode, you can check the high bit of any given octet and "safely" skip over it and any adjacent nonascii sequence if you don't know what it means. This may or may not be applicable to a task at hand.

1 comments

> This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above.

This is true, however even something as simple as storing the (byte) length as part of the string reduces the complexity and the likelihood for bugs.

Other languages also prevent accidental buffer overruns so while they still need to deal with all the same Unicode problems you mentioned, the program likely won't crash if the programmer gets things wrong. The same is not necessarily true of C.