Hacker News new | ask | show | jobs
by MichaelGG 2948 days ago
Fast checking is really useful in things like HTTP/SIP parsing. Rust should expose such a function as well seeing as their strings must be UTF-8 validated. Though it's even faster if you can just avoid utf8 strings and work only on a few known ASCII bytes, it means you might push garbage further down the line.
2 comments

> Rust should expose such a function as well seeing as their strings must be UTF-8 validated.

That's more or less what std::str::from_utf8 is: it runs UTF8 validation on the input slice, and just casts it to an &str if it's valid: https://doc.rust-lang.org/src/core/str/mod.rs.html#332-335

from_utf8_unchecked nothing more than an unsafe (c-style) cast: https://doc.rust-lang.org/src/core/str/mod.rs.html#437 and so should be a no-op at runtime.

I meant Rust should have a SIMD optimised version that assumes mostly ASCII. I'm guessing there is a trade-off involved depending on the content of the string.
The linked implementation assumes mostly ASCII. It doesn't use SIMD. SIMD in Rust is a work in progress - the next version (1.27) will stabilize x86-64-specific SIMD intrinsics. There's an rfc (https://github.com/rust-lang/rfcs/pull/2366) for portable SIMD types.
simd support in Rust was only recently accepted and is being implemented so it currently relies on the vectorisation abilities on the compiler (it might get revisited soon-ish I guess).

As for the assumption of mostly-ascii, the validation function has a "striding" fast path for ascii which checks 2 words at a time (so 128 bits per iteration on 64b platforms) until it finds a non-ascii byte: https://doc.rust-lang.org/src/core/str/mod.rs.html#1541

std::str::from_bytes is that API.

Rust’s current implementation of full validation: https://github.com/rust-lang/rust/blob/2a3f5367a23a769a068c3...

I have a vague feeling there’s an even faster path for probably-ASCII out there, but I can’t immediately recall where and am going to bed. Doubtless someone else will answer before I get up.

The core team will be amenable to replacing this algorithm with something faster presuming it’s still correct.

Given the new simd features, it's probably time to revisit that now
There is probably a trade-off depending on the content of the string, right? So the API probably needs a general-purpose and a "this should be all ASCII" version?
I don't know if it really makes a lot of sense to have such a specialized version of a method in the standard library. Effectively from_str and from_str_mostlyascii would be functionally identical except for a small performance difference depending on the encoding of the string data which wouldn't matter in 99% of programs.

Having that as a 3rd party crate would make complete sense however.

If there's a canonical, obvious, and low maintenance way to do something that there's a common need for, then there's no reason it shouldn't be in std. If it needs to be hacked at and redesigned every few years, then yeah, leave it out.
That's not really the Rust way, for instance things like timekeeping and random number generation are delegated to external crates. Given how trivial it is to add dependencies thanks to cargo it's not really an issue in practice, even if I do think that they go a bit overboard at times (time really ought to be in std IMO, especially since there's already a rather useless std::time module).