Hacker News new | ask | show | jobs
by imron 3371 days ago
> So it's somehow C's fault that Unicode uses variable-length encoding

Parent said string handling in C was elegant. My point is that it becomes fraught with (even more) issues once you throw non-English language at it.

It is C's decision to handle strings in this way, and the decision of many C programmers to treat all strings as if they are just iterable character pointers.

It's a recipe for bugs.

1 comments

I am the parent you are talking about. I've made this argument many times with people: Unicode is crazy complicated in any programming language. People think that widening the char width will help - well you seem to be somebody who knows Unicode so you probably know the horrors of surrogates, combining characters vs. pre-composed diacritics, zero-width joiners, Han unification, variation selectors, BiDi... This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above. They all punt the issue to the programmer.

I've heard (mostly here) that Swift does something different and treats glyphs as the basic unit. I haven't had a chance to look at precisely what that does. Given all the issues I've seen elsewhere I'm skeptical that someone, anyone can pull that off correctly.

UTF-8 at least has one elegance (there's that word again) in the design in that you can do some "dumb" ASCII things and if your code does not know what to do with fancy unicode, you can check the high bit of any given octet and "safely" skip over it and any adjacent nonascii sequence if you don't know what it means. This may or may not be applicable to a task at hand.

> This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above.

This is true, however even something as simple as storing the (byte) length as part of the string reduces the complexity and the likelihood for bugs.

Other languages also prevent accidental buffer overruns so while they still need to deal with all the same Unicode problems you mentioned, the program likely won't crash if the programmer gets things wrong. The same is not necessarily true of C.