Hacker News new | ask | show | jobs
by fluffything 2233 days ago
> Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters

Doing this is quite easy from Rust.

1 comments

There's no standard library API in Rust for either grapheme cluster segmentation or for normalization. You'd need a third-party crate, at which point it's less "easy in Rust" and more "someone did the hard work for you" because, its really not easy anywhere haha.
No, and there probably never be a libstd implementation of it, but there is a single crate that everybody uses: unicode-segmentation [0]

> You'd need a third-party crate, at which point it's less "easy in Rust" and more "someone did the hard work for you"

"Someone did the work for you" is true of all code that you did not write yourself, independently of whether that code is easy or hard to write, or whether it is in the standard library or not.

unicode-segmentation is pretty much the only library of its kind in Rust, is super easy to discover (google "Rust grapheme cluster", "Rust unicode segmentation", etc.), and using it is as easy as just typing "cargo add unicode-segmentation".

The library is maintained by a Rust core team member, a Rust standard library team member, is used by servo and firefox, and is the only unicode segmentation library that people use.

Since many programs don't need to do any kind of unicode segmentation, making it part of the standard library sounds like a bad idea. In particular, given that unicode is a moving standard, it would mean that people stuck on old Rust toolchains (e.g. LTS linux distros) cannot create binaries that do proper unicode segmentation, which does not make sense.

The underlying problem is that many programmers do not know that they need to do unicode segmentation in the first place. Moving this into the standard library does not fix that problem either.

[0]: https://crates.io/crates/unicode-segmentation

> Since many programs don't need to do any kind of unicode segmentation, making it part of the standard library sounds like a bad idea. In particular, given that unicode is a moving standard, it would mean that people stuck on old Rust toolchains (e.g. LTS linux distros) cannot create binaries that do proper unicode segmentation, which does not make sense.

That has nothing to do with it. You could still have a library that has the very latest Unicode standard support for those that need the very latest, and keep updating the stdlib one.

It does not make sense either to expect someone to use bleeding edge libraries from cargo yet use an old rustc compiler. They can easily update it if needed.

> It does not make sense either to expect someone to use bleeding edge libraries from cargo yet use an old rustc compiler.

Of course it does. Many software users are stuck on multiple-year-old toolchains for various reasons, yet these systems still need to be able to handle unicode properly.

> They can easily update it if needed.

No, they cannot. Many users are stuck in older windows versions, linux versions, LTS linux versions, etc. because of their organization or their clients requirements.

Telling a client that you can't develop an app for them because their 2 year old Ubuntu is too old is often not a very successful business model.

> and keep updating the stdlib one.

These updates would only apply to newer Rust toolchains, that many users cannot use. Unless you are suggesting the release of patch versions for the soon to be 100 old Rust toolchains in existence every time the unicode standard is updated.

This is too much trouble and work for little gain, given that one can still use a Rust 1.0 compiler to compile the latest version of the unicode-segmentation crate without problems.

IMO it's not as clear-cut as you make it out to be. It's a pretty arbitrary line to exclude full Unicode support from the standard library. There's a ton of stuff in libstd that could be supported as third-party crates. I don't disagree with what the Rust team has done, and I think there could be a world in which the compiler team also releases first-party crates with "enhanced" functionality beyond just libstd. I consider proper Unicode support to be a "first party" thing, but I also don't think it has to be in libstd per se, necessarily.

For the record, I also disagree with your assertion that "easily done in rust" should be extended to include "...by importing a third-party framework." In that sense anything is easy to do in any language where a third-party framework exists. I'm confident it's just as easy in go.

> For the record, I also disagree with your assertion that "easily done in rust" should be extended to include "...by importing a third-party framework." In that sense anything is easy to do in any language where a third-party framework exists. I'm confident it's just as easy in go.

Have you tried doing that in C++? Doing that in a cross-platform way (or even in a single platform) is anything but easy, because you don't have a tool like cargo, you have to change your build system, do the dependency resolution manually, etc.

So no, such a library existing does not imply that using that is easy.

In Rust, you just need to write `cargo add unicode-segmentation` once in a project, and then you can directly use the library API. There is literally nothing else for you to do.

That's a pretty low barrier of entry, and something you will need to do 100s of times per project anyway, because the standard library is minimal by design.

If you prefer languages without a minimal standard library, then Rust isn't for you. Go try Python, where half of the standard library has a warning saying "deprecated: use this other better external dependency instead; adding this to the standard library for convenience was the worst idea ever and now we need to maintain all this code forever".

the point would be that the standard library should be forward compatible while crates should be backward compatible.

so that current crates work with old version of compilers/toolchains.

this applies here as each new Unicode standard requires an update of the Unicode crate. ideally the best case would be to make it so that in 20 years Rust 1.0 can still use the most updated version of Unicode fragmentation. similarly to how some C libraries insist on C89 compatibility to still work on older systems.

I guess Rust would like it if this never became indispensable but also should be possible

> Of course it does. Many software users are stuck on multiple-year-old toolchains for various reasons, yet these systems still need to be able to handle unicode properly.

So? Use the external library then. One thing does not preclude the other.

> No, they cannot. Many users are stuck in older windows versions, linux versions, LTS linux versions, etc. because of their organization or their clients requirements.

I work in such an organization and no, we cannot use third-party packages. The same way we cannot update our toolchain. So in most cases the point is moot.

> These updates would only apply to newer Rust toolchains, that many users cannot use. Unless you are suggesting the release of patch versions for the soon to be 100 old Rust toolchains in existence every time the unicode standard is updated.

You can provide standard Unicode handling that is good enough for 99% software out there. If you need to be on the bleeding edge, then use the bleeding edge library or rustc.

It is pretty simple, actually!

> So? Use the external library then. One thing does not preclude the other.

That's what everybody already does? You are proposing to, instead of doing that, move that library into the standard library where it cannot ever change.

> You can provide standard Unicode handling that is good enough for 99% software out there.

That's already in std? 99% of the code doesn't need to handle unicode grapheme clusters, because it doesn't deal with unicode at all.

You are suggesting moving something into standard that would make unicode software harder to update, and would make the standard library huge (>20mb larger) for all programs (the unicode tables take a lot of binary size), even those that don't use unicode, to try to solve a problem that does not exist.

> I work in such an organization and no, we cannot use third-party packages

If a Rust user cannot write `cargo add unicode-segmentation`, they have bigger problems than not being able to handle grapheme clusters. You can't run async code because you don't have an executor, you can't do http because the standard library doesn't support that, you can't solve partial differential equations, or do machine learning, or pretty much anything interesting with Rust.

That's bad for you, but the solution isn't to make Rust bad for everybody else instead.

If your organization doesn't let you use third-party packages, then write your own: that's what your organization wants you to do.

Some organizations want all code in CamelCase, they can't use the standard library at all. But the solution isn't to make Rust case insensitive, or to prove a 2nd standard library API for those organizations.