Hacker News new | ask | show | jobs
by zigzigzag 3293 days ago
That seems like a really unfortunate design decision. I used to think that Java's use of UTF-16 for strings was just a problematic legacy thing, but compared to this it seems quite good. Strings are pretty high performance and there are no complex calculations to do indexing or bounds checks. And in Java 9 the JVM can switch between UTF-16 or Latin1 encodings on the fly, which both uses less RAM and speeds things up simultaneously. There are no memory safety issues caused by character encodings.
2 comments

I think there might be a few misconceptions in your comment, but it's hard to say for sure. This article isn't written in a style that I particularly enjoy, but it's not too long and contains some useful facts that might help dispel those misconceptions: http://utf8everywhere.org

I do have my unrelated niche complaints about Rust's string story (and I have vague plans to resolve them), but Rust's string implementation is my favorite among any other language I've used.

What seems like an unfortunate design decision? I'm unclear what concerns you're seeing that would suggest UTF-16 as preferable to UTF-8, either in terms of performance or memory safety.
To not only use UTF-8 as the internal string encoding but practically mandate it, if you want to remain safe.

UTF-8 is a fine transport format, but for raw runtime performance it's obviously going to be an issue if you ever need to iterate over characters, do substring matches, things like that because you can't do constant time "next char" or indexing.

UTF-16 doesn't let you do that either in the presence of combining characters, but they're pretty rare and for many operations it doesn't really matter.

I feel like your comment contains a lot of misunderstanding about UTF-8. For example, UTF-8 is self-synchronizing, which means you can indeed find the "next char" in constant time.

UTF-8 is certainly not a problem for runtime performance. Substring search, for example, is as straight-forward as you might imagine. You have a needle in UTF-8 and a haystack in UTF-8, and a straight-forward application of `memmem` will work just fine (for example). In fact, UTF-8 works out great for performance , because it's very simple to apply fast routines like `memchr`. e.g., If you `memchr` for `a`, then because of UTF-8's self-synchronizing property, any and all matches for `a` actually correspond to the codepoint U+0061.

Indexing works fine so long as your indices are byte offsets at valid UTF-8 boundaries. Byte offset indexing tends to be useful for mechanical transformations on a string. For example, if you know your `substring` starts as position `i` in `mystr`, then `&mystr[i + substring.len()..]` gives you the slice of `mystr` immediately following your substring in constant time. When all your APIs deal in byte offsets, this turns out to be a perfectly natural thing to do.

Generally speaking, indexing by Unicode codepoint isn't an operation you want to do, because it tends to betray the problem you're trying to solve. For example, if you wanted to display a trimmed string to an end user by "selecting the first 9 characters," then selecting the first 9 codepoints would result in bad things in some circumstances, and it's not just limited to the presence of combining characters. For example, UTF-16 encodes codepoints outside the basic multilingual plane using surrogate pairs, where a surrogate pair consists of two surrogate codepoints that combine to form a single Unicode scalar value (i.e., a non-surrogate codepoint). So if you do the "obvious" thing with UTF-16, you'll wind up with bad results in not-exactly-corner cases.

It's worth noting that Rust isn't alone in this. Go represents strings similarly and it also works remarkably well. (The only difference between Go and Rust is that Rust's string type is guaranteed to contain valid UTF-8 where as Go's string type is conventionally UTF-8.) Notably, you won't find "character indexing" anywhere in Go's standard library or various Unicode support libraries. :-)

I would very strongly urge you to read my link in my previous comment to you. I think it would help clarify a lot of misconceptions.

This has been repeated so many times but UTF-16 does not allow constant time indexing either. Combining characters are one case and they are not rare at all. What about surrogates? What about grapheme clusters that are a complicated sequence of emoji and emoji modifiers with ZWJ?

A better suggestion is to rethink why you need those operations in the first place.