Hacker News new | ask | show | jobs
by Animats 3209 days ago
Python took the obvious approach - they already had UTF-16 and UTF-32 builds, so this was just making that mechanism dynamic.

Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

Here's an alternative: Use UTF-8 as the internal representation, but don't expose it to the user.

If you're iterating over a string one rune or one grapheme at a time, the UTF-8 substructure is hidden from the user. Only if the user uses an explicit numeric subscript do you need to know a rune's position in string. When a request by subscript comes in, scan the string and build an index of rune subscript->byte position. This is expensive, but no worse than UTF-32 in space usage or expansion to UTF-32 in time.

Optimizations:

- Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)

- Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.

- Regular expression processing has to be UTF-8 aware. It shouldn't need an index by rune.

This would maintain Python's existing semantics while reducing memory consumption.

Some performance measurement tool that finds all the places where an index by rune has to be built is useful. It's rare that you really need this, but sometimes you do.

5 comments

> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

In Go, yes. In Rust, no. UTF-8 in Go is garbage in, garbage out. Rust, however, won't let you materialize an invalid &str without "unsafe".

The difference is that Go expects your majority use case to be copy or concatenate. If you're taking a string sequence value you're normally not going to change it, or you're going to combine it together with something else. If you have valid UTF-8 input, you should get output that is valid, but might not be 'normalized' to a single form. IF you care about normalizing you can decide when to do that (usually in output construction).

If you need to make a decision based on the content of a string, then you often need to make a normalized (the same way for both) copy the inputs.

Most importantly, if you feed in garbage, you get out the SAME garbage. The real world, and historical data, are messy. Trying to be smart can often lead to the most disastrous consequences. Being conservative and tolerant allows for intentional planning to handle the conversion at the source, if and when desired.

> Go and Rust expose UTF-8 at the byte level.

Or you can take the C++/C approach and have a character 1 byte, 2 bytes, or a multi-byte. It's a pain in the ass to constantly in C/C++ having to interface between two libraries that one decided to use char and another w_char!

The way the C and C++ committees approach Unicode is even worse than Python breaking away from UTF-16 in the wrong direction (UTF-32 being the wrong direction and UTF-8 being the right direction).

The first rule of reasonably happy C and C++ Unicode programming is not to use wchar_t for any purpose other than immediate interaction with the Win32 API.

The second rule of reasonably happy C and C++ Unicode programming is not to use the standard library facilities (which depend on the execution environment) for text processing but using some other library where the UTF-* interpretation of inputs and outputs doesn't shift depending on the execution environment or compilation environment.

This is where we run into a little bit of a problem. You have a char pointer that can be either a multiple byte encoded (depending on the code page window is using). It also can be UTF-8 encoded. Then when you move onto windows wchar_t that is originally defined as (UCS-2) then was later renamed to UTF-16, due to surrogate pair's.

So in the windows world with COM/DCOM you're basically nugged into using UTF-16 wchar_t or it becomes a hell of a lot of pain. So it is easier just simply to accept to use UTF-16 and do all the conversion from UTF-8, UTF-32, code pages to a single encoding standard.

You could just wrap that pointer in a class that describes what it is - ideally at the type level (Utf8String, etc). Each string class knows how to convert from other string types, and any library calls get wrapped in a method that is either templated on the string input type(s) or takes a BaseString* and calls virtual conversion functions. Or force a manual call to convert each time so that your fellow developers know when slow conversions are happening for sure.

It is a crappy situation though. Pick where you want your pain point to be.

See http://utf8everywhere.org

It talks about Windows quite a bit.

> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

There are pros and cons to both approaches. The prime ones being that []byte allows for easy random access, whereas []rune usually takes O(n) to work with (unless you store rune lengths separately, which is memory intensive).

I guess it's about the right level of abstraction, so that you can choose if you're working with bytes (binary I/O, when you know it's ascii etc.) and when with runes (most situations).

I still haven't decided whether I prefer the Python approach or the Go one.

If you look over Elixir's doc [0] on binary strings, they take the best of both worlds. The APIs are specifically crafted for least surprise, e.g. with `String.length()`, `byte_size`, `Strings.graphemes()` and `String.codepoints()` functions.

[0] https://hexdocs.pm/elixir/String.html

I think the devil is in the details of that opaque 'position' type.

With integers you can do things like concatenate two strings and adjust the indexes referring to the second string by adding the length of the first one. If you invent a new position type you have to add support for several things like this.

In any case I think the Python people were right to carry on using integers.

That would force a conversion from opaque type to integer, which would force creation of the rune to byte index. It doesn't have to be handled as a special case. The opaque type thing is an optimization hidden from the user. If you try to look at it, you get the integer value, expensively.
It's a shame to do all that work at runtime though, when the result is just going to be that the byte number in the opaque value is increased by the byte length of the first string.

I do think there's a great deal to be said for indexing and pattern matching returning opaque 'locations' (particularly if/when we have languages that let you verify at compile time that you're using the location with the right string).

But I fear doing it well would be distinctly fiddly.

> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices.

Rust will panic on invalid slices unless you first convert to raw bytes, and then it will not allow converting invalid slices back to a string in safe rust (in unsafe you're obviously on your own).

Safe Rust guarantees and requires[0] that strings are valid UTF8 at all times.

That aside, essentially all of your desires are part of Swift's string, you should check them out.

> Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)

Rust does that through the `chars()` iterator[1] which iterates through USVs (codepoints) and can be iterated from both ends. Sadly unlike Swift it does not ship with a grapheme cluster iterator. Happily there is a unicode_segmentation crate[2]. Swift also uses iterators but has more of them: the default iteration works on extended grapheme clusters, and alternate iterators are USV, UTF-16 and UTF-8.

If indexing is necessary for some reason Rust also has char_indices() which iterates on the USV and its (byte) position in the string.

> - Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.

That is what Swift does. `String.index(of:String)` will return a String.Index: https://developer.apple.com/documentation/swift/string.index and indexed String methods will work based on that index type. This includes "reindexing" (offsetting) which is done using String.index(String.Index, offsetBy: String.IndexDistance). Furthermore String exposes two built-in indexes startIndex and endIndex as well as an "indices" iterator.

> This would maintain Python's existing semantics while reducing memory consumption.

It would not maintain O(1) USV indexing (especially in the C API), which was the reason for not just switching to UTF8.

In fact, FSR strings already contain a full UTF8 representation of the string[3], which the latin1 representation can share for pure ASCII strings.

[0] a non-utf8 str is one of Rust's 10 undefined behaviours, part of the "invalid primitive values" section alongside null references or invalid booleans: https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://kbknapp.github.io/clap-rs/unicode_segmentation/index...

[3] https://github.com/python/cpython/blob/49b2734bf12dc1cda80fd...