Hacker News new | ask | show | jobs
by hsivonen 3213 days ago
> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

In Go, yes. In Rust, no. UTF-8 in Go is garbage in, garbage out. Rust, however, won't let you materialize an invalid &str without "unsafe".

2 comments

The difference is that Go expects your majority use case to be copy or concatenate. If you're taking a string sequence value you're normally not going to change it, or you're going to combine it together with something else. If you have valid UTF-8 input, you should get output that is valid, but might not be 'normalized' to a single form. IF you care about normalizing you can decide when to do that (usually in output construction).

If you need to make a decision based on the content of a string, then you often need to make a normalized (the same way for both) copy the inputs.

Most importantly, if you feed in garbage, you get out the SAME garbage. The real world, and historical data, are messy. Trying to be smart can often lead to the most disastrous consequences. Being conservative and tolerant allows for intentional planning to handle the conversion at the source, if and when desired.

> Go and Rust expose UTF-8 at the byte level.

Or you can take the C++/C approach and have a character 1 byte, 2 bytes, or a multi-byte. It's a pain in the ass to constantly in C/C++ having to interface between two libraries that one decided to use char and another w_char!

The way the C and C++ committees approach Unicode is even worse than Python breaking away from UTF-16 in the wrong direction (UTF-32 being the wrong direction and UTF-8 being the right direction).

The first rule of reasonably happy C and C++ Unicode programming is not to use wchar_t for any purpose other than immediate interaction with the Win32 API.

The second rule of reasonably happy C and C++ Unicode programming is not to use the standard library facilities (which depend on the execution environment) for text processing but using some other library where the UTF-* interpretation of inputs and outputs doesn't shift depending on the execution environment or compilation environment.

This is where we run into a little bit of a problem. You have a char pointer that can be either a multiple byte encoded (depending on the code page window is using). It also can be UTF-8 encoded. Then when you move onto windows wchar_t that is originally defined as (UCS-2) then was later renamed to UTF-16, due to surrogate pair's.

So in the windows world with COM/DCOM you're basically nugged into using UTF-16 wchar_t or it becomes a hell of a lot of pain. So it is easier just simply to accept to use UTF-16 and do all the conversion from UTF-8, UTF-32, code pages to a single encoding standard.

You could just wrap that pointer in a class that describes what it is - ideally at the type level (Utf8String, etc). Each string class knows how to convert from other string types, and any library calls get wrapped in a method that is either templated on the string input type(s) or takes a BaseString* and calls virtual conversion functions. Or force a manual call to convert each time so that your fellow developers know when slow conversions are happening for sure.

It is a crappy situation though. Pick where you want your pain point to be.

See http://utf8everywhere.org

It talks about Windows quite a bit.