| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by masklinn 3209 days ago

> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices.

Rust will panic on invalid slices unless you first convert to raw bytes, and then it will not allow converting invalid slices back to a string in safe rust (in unsafe you're obviously on your own).

Safe Rust guarantees and requires[0] that strings are valid UTF8 at all times.

That aside, essentially all of your desires are part of Swift's string, you should check them out.

> Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)

Rust does that through the `chars()` iterator[1] which iterates through USVs (codepoints) and can be iterated from both ends. Sadly unlike Swift it does not ship with a grapheme cluster iterator. Happily there is a unicode_segmentation crate[2]. Swift also uses iterators but has more of them: the default iteration works on extended grapheme clusters, and alternate iterators are USV, UTF-16 and UTF-8.

If indexing is necessary for some reason Rust also has char_indices() which iterates on the USV and its (byte) position in the string.

> - Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.

That is what Swift does. `String.index(of:String)` will return a String.Index: https://developer.apple.com/documentation/swift/string.index and indexed String methods will work based on that index type. This includes "reindexing" (offsetting) which is done using String.index(String.Index, offsetBy: String.IndexDistance). Furthermore String exposes two built-in indexes startIndex and endIndex as well as an "indices" iterator.

> This would maintain Python's existing semantics while reducing memory consumption.

It would not maintain O(1) USV indexing (especially in the C API), which was the reason for not just switching to UTF8.

In fact, FSR strings already contain a full UTF8 representation of the string[3], which the latin1 representation can share for pure ASCII strings.

[0] a non-utf8 str is one of Rust's 10 undefined behaviours, part of the "invalid primitive values" section alongside null references or invalid booleans: https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://kbknapp.github.io/clap-rs/unicode_segmentation/index...

[3] https://github.com/python/cpython/blob/49b2734bf12dc1cda80fd...