Hacker News new | ask | show | jobs
by Skalman 4001 days ago
It's an encoding that isn't good at anything: it's neither ASCII-compatible (like UTF-8), nor fixed-length (like UTF-32), but because most characters require only 2 bytes, developers frequently assume that none require more, leading to bugs when a character eventually is represented by 4 bytes.
3 comments

> fixed-length (like UTF-32),

Utf-32 is only fixed length if you don't care about diacritics, variation selectors, RTL languages, and others. Unicode is not one code point or one char/wchar/uint32 per glyph.

You've changed topic from code points to grapheme clusters. Rust's character/string support is strictly for code points (the documentation is fairly clear about the distinction).

Few string libraries actually deal with grapheme clusters as the native underlying representation (Swift being a notable exception).

The broader point I'm making is that unicode is hard and attempts to simplify it by choosing a different encoding (i.e. switching to utf-32 to save yourself from all problems) are a bit misguided.
My life is a lie.
UTF-32 is not good for anything either, easy access to codepoints is just as useless as access to UTF-8 bytes. Any meaningful operation on text (even counting number of characters) requires parsing grapheme clusters, which have variable length regardless of what encoding is used.
I don't know much about Rust and Rust library, so I have a question: what if I what to develop Windows only software in Rust, will I need to convert back and forth between UTF-16 and UTF-8 (or whatever Rust uses in other parts of the library)?
Yes.

The Rust std library had to pick a string encoding, and it picked UTF-8 (which is really the best Unicode encoding). The String type is platform neutral and always UTF-8.

However, it does provide an OsString type, which on windows is UTF-16. Maybe there is a library - and if not, one could be written - targeting Windows only, and implementing stronger UTF-16 string processing on the OsString type.

EDIT: To be clear, Rust's trait system makes this very easy to do. You just define all the methods you want OsString to have in a trait WindowsString, and implement it for OsString, even though OsString is a std library type. One of the great things about Rust is that its trivial to use the std library as shared "pivot" which various third party libraries extend according to your use case.

I believe Rust uses WTF-8 as an intermediate format for windowsy things (cheaper), but I'm not sure.
What is... oh... UTF-16, the gift that keeps on giving... this is, at the same time, utterly hilarious and horribly depressing:

https://simonsapin.github.io/wtf-8/

But there is actually prior art here - Java's contribution to perverse Unicode encodings is called "Modified UTF-8" and encodes every UTF-16 surrogate code unit separately.

http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.h...

We have an http://doc.rust-lang.org/stable/std/ffi/struct.OsString.html to abstract over a native string in whatever encoding your platform has. Generally, things that interact with the OS use these, and they can convert to a UTF-8 String.
Suppose I'm on Linux, but I want to interact with Windows stuff. (CIFS protocol, NTFS on-disk format, disassembler for Windows executables, Wine-like program, cross-compiler, etc.)

I'll be wanting UTF-16 support. Going the other way matters too; if I'm on Windows I may need UCS-32 support.

Sure. That's not a problem. You can write any kind of string type you want, as a library, and convert between them. One of the nice things about Rust is that it's low-level enough that almost everything is a library anyway, so the language won't get in your way if you need SomeNicheString.
Since the full bullet point was “UTF-16 or UCS-2 support anywhere outside windows API compatibility routines” I'm assuming you'd get UTF-8 out of any high-level interface.