| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Skalman 4001 days ago
	It's an encoding that isn't good at anything: it's neither ASCII-compatible (like UTF-8), nor fixed-length (like UTF-32), but because most characters require only 2 bytes, developers frequently assume that none require more, leading to bugs when a character eventually is represented by 4 bytes.

3 comments

asveikau 4001 days ago

> fixed-length (like UTF-32),

Utf-32 is only fixed length if you don't care about diacritics, variation selectors, RTL languages, and others. Unicode is not one code point or one char/wchar/uint32 per glyph.

link

gilgoomesh 4001 days ago

You've changed topic from code points to grapheme clusters. Rust's character/string support is strictly for code points (the documentation is fairly clear about the distinction).

Few string libraries actually deal with grapheme clusters as the native underlying representation (Swift being a notable exception).

link

asveikau 4001 days ago

The broader point I'm making is that unicode is hard and attempts to simplify it by choosing a different encoding (i.e. switching to utf-32 to save yourself from all problems) are a bit misguided.

link

sidarape 4001 days ago

My life is a lie.

link

hamstergene 4001 days ago

UTF-32 is not good for anything either, easy access to codepoints is just as useless as access to UTF-8 bytes. Any meaningful operation on text (even counting number of characters) requires parsing grapheme clusters, which have variable length regardless of what encoding is used.

link

yurish 4001 days ago

I don't know much about Rust and Rust library, so I have a question: what if I what to develop Windows only software in Rust, will I need to convert back and forth between UTF-16 and UTF-8 (or whatever Rust uses in other parts of the library)?

link

tatterdemalion 4001 days ago

Yes.

The Rust std library had to pick a string encoding, and it picked UTF-8 (which is really the best Unicode encoding). The String type is platform neutral and always UTF-8.

However, it does provide an OsString type, which on windows is UTF-16. Maybe there is a library - and if not, one could be written - targeting Windows only, and implementing stronger UTF-16 string processing on the OsString type.

EDIT: To be clear, Rust's trait system makes this very easy to do. You just define all the methods you want OsString to have in a trait WindowsString, and implement it for OsString, even though OsString is a std library type. One of the great things about Rust is that its trivial to use the std library as shared "pivot" which various third party libraries extend according to your use case.

link

Manishearth 4001 days ago

I believe Rust uses WTF-8 as an intermediate format for windowsy things (cheaper), but I'm not sure.

link

the_why_of_y 4001 days ago

What is... oh... UTF-16, the gift that keeps on giving... this is, at the same time, utterly hilarious and horribly depressing:

https://simonsapin.github.io/wtf-8/

But there is actually prior art here - Java's contribution to perverse Unicode encodings is called "Modified UTF-8" and encodes every UTF-16 surrogate code unit separately.

http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.h...

link

steveklabnik 4001 days ago

We have an http://doc.rust-lang.org/stable/std/ffi/struct.OsString.html to abstract over a native string in whatever encoding your platform has. Generally, things that interact with the OS use these, and they can convert to a UTF-8 String.

link

milspec 4001 days ago

Suppose I'm on Linux, but I want to interact with Windows stuff. (CIFS protocol, NTFS on-disk format, disassembler for Windows executables, Wine-like program, cross-compiler, etc.)

I'll be wanting UTF-16 support. Going the other way matters too; if I'm on Windows I may need UCS-32 support.

link

steveklabnik 4000 days ago

Sure. That's not a problem. You can write any kind of string type you want, as a library, and convert between them. One of the nice things about Rust is that it's low-level enough that almost everything is a library anyway, so the language won't get in your way if you need SomeNicheString.

link

acdha 4001 days ago

Since the full bullet point was “UTF-16 or UCS-2 support anywhere outside windows API compatibility routines” I'm assuming you'd get UTF-8 out of any high-level interface.

link