Hacker News new | ask | show | jobs
by Kwpolska 1617 days ago
Why does something as basic as uppercasing a string or decoding latin1 require a third-party library? I would expect that to be part of stdlib in any language. Also, why does that third-party library come with its own string implementation? What if my dependency X uses zigstr but dependency Y prefers zig-string <https://github.com/JakubSzark/zig-string>? Basically all languages designed in the past 30 years have at least basic and correct-for-BMP Unicode support built-in/as part of stdlib. Why doesn’t Zig?
3 comments

That's not "simple". Rust also does neither of those two tasks with just the stdlib!

- latin1 is dead and should be in no stdlib in 2022 - uppercasing requires the current Unicode tables, so, a largish moving target that you probably don't want to embed in small programs.

Latin-1 is actually the first 256 code points from Unicode. So, you can do that in Rust by casting u8 (the Latin-1 bytes) into char (Unicode scalar values). That's unintuitive perhaps because of course in C that wouldn't do anything useful since the char type isn't Unicode, but in Rust that's exactly what you wanted.

In this environment you might very well not need actual uppercase/ lowercase but only the ASCII subset. Accordingly Rust provides that too, which is far less to carry around than the Unicode case rules. Since the ASCII case change can always be performed in situ (if you can modify the data) Rust provides that too if it's what you want.

Those are all valid points. At the moment I believe Zig has decided to leave full unicode support out of std because they don't want language releases dependent on unicode updates.
> they don't want language releases dependent on unicode updates.

I'm sorry, what do you mean by this?

The "rules" of unicode change over time with updates to the unicode standard(s). One big one is the grapheme breaking algorithm, which has been updated over time to support things like the family emoji and other compositions.
That should be strictly related to the rendering
correct-for-BMP-but-not-otherwise is simply a bug (and cultural chauvinism). And almost all of such implementations aren't correct-for-BMP because uppercasing Unicode is far from "basic".