| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arcticbull 2239 days ago

Go's version and the Rust version differ in yet more subtle ways. It appears that Go's "rune" type is a Code Point, but Rusts's "char" type is a Unicode Scalar Value, a subset of Code Point that excludes surrogate pairs. Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters.

Unicode is hard, fams, and it's rare that anything that looks easy is actually what you want.

[1] https://unicode.org/reports/tr29/

[2] http://unicode.org/reports/tr15/

1 comments

fluffything 2239 days ago

> Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters

Doing this is quite easy from Rust.

link

arcticbull 2239 days ago

There's no standard library API in Rust for either grapheme cluster segmentation or for normalization. You'd need a third-party crate, at which point it's less "easy in Rust" and more "someone did the hard work for you" because, its really not easy anywhere haha.

link

fluffything 2239 days ago

No, and there probably never be a libstd implementation of it, but there is a single crate that everybody uses: unicode-segmentation [0]

> You'd need a third-party crate, at which point it's less "easy in Rust" and more "someone did the hard work for you"

"Someone did the work for you" is true of all code that you did not write yourself, independently of whether that code is easy or hard to write, or whether it is in the standard library or not.

unicode-segmentation is pretty much the only library of its kind in Rust, is super easy to discover (google "Rust grapheme cluster", "Rust unicode segmentation", etc.), and using it is as easy as just typing "cargo add unicode-segmentation".

The library is maintained by a Rust core team member, a Rust standard library team member, is used by servo and firefox, and is the only unicode segmentation library that people use.

Since many programs don't need to do any kind of unicode segmentation, making it part of the standard library sounds like a bad idea. In particular, given that unicode is a moving standard, it would mean that people stuck on old Rust toolchains (e.g. LTS linux distros) cannot create binaries that do proper unicode segmentation, which does not make sense.

The underlying problem is that many programmers do not know that they need to do unicode segmentation in the first place. Moving this into the standard library does not fix that problem either.

[0]: https://crates.io/crates/unicode-segmentation

link

jfkebwjsbx 2239 days ago

> Since many programs don't need to do any kind of unicode segmentation, making it part of the standard library sounds like a bad idea. In particular, given that unicode is a moving standard, it would mean that people stuck on old Rust toolchains (e.g. LTS linux distros) cannot create binaries that do proper unicode segmentation, which does not make sense.

That has nothing to do with it. You could still have a library that has the very latest Unicode standard support for those that need the very latest, and keep updating the stdlib one.

It does not make sense either to expect someone to use bleeding edge libraries from cargo yet use an old rustc compiler. They can easily update it if needed.

link

fluffything 2239 days ago

> It does not make sense either to expect someone to use bleeding edge libraries from cargo yet use an old rustc compiler.

Of course it does. Many software users are stuck on multiple-year-old toolchains for various reasons, yet these systems still need to be able to handle unicode properly.

> They can easily update it if needed.

No, they cannot. Many users are stuck in older windows versions, linux versions, LTS linux versions, etc. because of their organization or their clients requirements.

Telling a client that you can't develop an app for them because their 2 year old Ubuntu is too old is often not a very successful business model.

> and keep updating the stdlib one.

These updates would only apply to newer Rust toolchains, that many users cannot use. Unless you are suggesting the release of patch versions for the soon to be 100 old Rust toolchains in existence every time the unicode standard is updated.

This is too much trouble and work for little gain, given that one can still use a Rust 1.0 compiler to compile the latest version of the unicode-segmentation crate without problems.

link

arcticbull 2239 days ago

IMO it's not as clear-cut as you make it out to be. It's a pretty arbitrary line to exclude full Unicode support from the standard library. There's a ton of stuff in libstd that could be supported as third-party crates. I don't disagree with what the Rust team has done, and I think there could be a world in which the compiler team also releases first-party crates with "enhanced" functionality beyond just libstd. I consider proper Unicode support to be a "first party" thing, but I also don't think it has to be in libstd per se, necessarily.

For the record, I also disagree with your assertion that "easily done in rust" should be extended to include "...by importing a third-party framework." In that sense anything is easy to do in any language where a third-party framework exists. I'm confident it's just as easy in go.

link

jfkebwjsbx 2238 days ago

> Of course it does. Many software users are stuck on multiple-year-old toolchains for various reasons, yet these systems still need to be able to handle unicode properly.

So? Use the external library then. One thing does not preclude the other.

> No, they cannot. Many users are stuck in older windows versions, linux versions, LTS linux versions, etc. because of their organization or their clients requirements.

I work in such an organization and no, we cannot use third-party packages. The same way we cannot update our toolchain. So in most cases the point is moot.

> These updates would only apply to newer Rust toolchains, that many users cannot use. Unless you are suggesting the release of patch versions for the soon to be 100 old Rust toolchains in existence every time the unicode standard is updated.

You can provide standard Unicode handling that is good enough for 99% software out there. If you need to be on the bleeding edge, then use the bleeding edge library or rustc.

It is pretty simple, actually!

link