| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Kwpolska 1617 days ago
	Why does something as basic as uppercasing a string or decoding latin1 require a third-party library? I would expect that to be part of stdlib in any language. Also, why does that third-party library come with its own string implementation? What if my dependency X uses zigstr but dependency Y prefers zig-string <https://github.com/JakubSzark/zig-string>? Basically all languages designed in the past 30 years have at least basic and correct-for-BMP Unicode support built-in/as part of stdlib. Why doesn’t Zig?

3 comments

c-cube 1617 days ago

That's not "simple". Rust also does neither of those two tasks with just the stdlib!

- latin1 is dead and should be in no stdlib in 2022 - uppercasing requires the current Unicode tables, so, a largish moving target that you probably don't want to embed in small programs.

link

tialaramex 1616 days ago

Latin-1 is actually the first 256 code points from Unicode. So, you can do that in Rust by casting u8 (the Latin-1 bytes) into char (Unicode scalar values). That's unintuitive perhaps because of course in C that wouldn't do anything useful since the char type isn't Unicode, but in Rust that's exactly what you wanted.

In this environment you might very well not need actual uppercase/ lowercase but only the ASCII subset. Accordingly Rust provides that too, which is far less to carry around than the Unicode case rules. Since the ASCII case change can always be performed in situ (if you can modify the data) Rust provides that too if it's what you want.

link

cturtle 1617 days ago

Those are all valid points. At the moment I believe Zig has decided to leave full unicode support out of std because they don't want language releases dependent on unicode updates.

link

futharkshill 1617 days ago

> they don't want language releases dependent on unicode updates.

I'm sorry, what do you mean by this?

link

jshier 1617 days ago

The "rules" of unicode change over time with updates to the unicode standard(s). One big one is the grapheme breaking algorithm, which has been updated over time to support things like the family emoji and other compositions.

link

futharkshill 1616 days ago

That should be strictly related to the rendering

link

jibalt 1616 days ago

correct-for-BMP-but-not-otherwise is simply a bug (and cultural chauvinism). And almost all of such implementations aren't correct-for-BMP because uppercasing Unicode is far from "basic".

link