Hacker News new | ask | show | jobs
by hackits 3209 days ago
> Go and Rust expose UTF-8 at the byte level.

Or you can take the C++/C approach and have a character 1 byte, 2 bytes, or a multi-byte. It's a pain in the ass to constantly in C/C++ having to interface between two libraries that one decided to use char and another w_char!

1 comments

The way the C and C++ committees approach Unicode is even worse than Python breaking away from UTF-16 in the wrong direction (UTF-32 being the wrong direction and UTF-8 being the right direction).

The first rule of reasonably happy C and C++ Unicode programming is not to use wchar_t for any purpose other than immediate interaction with the Win32 API.

The second rule of reasonably happy C and C++ Unicode programming is not to use the standard library facilities (which depend on the execution environment) for text processing but using some other library where the UTF-* interpretation of inputs and outputs doesn't shift depending on the execution environment or compilation environment.

This is where we run into a little bit of a problem. You have a char pointer that can be either a multiple byte encoded (depending on the code page window is using). It also can be UTF-8 encoded. Then when you move onto windows wchar_t that is originally defined as (UCS-2) then was later renamed to UTF-16, due to surrogate pair's.

So in the windows world with COM/DCOM you're basically nugged into using UTF-16 wchar_t or it becomes a hell of a lot of pain. So it is easier just simply to accept to use UTF-16 and do all the conversion from UTF-8, UTF-32, code pages to a single encoding standard.

You could just wrap that pointer in a class that describes what it is - ideally at the type level (Utf8String, etc). Each string class knows how to convert from other string types, and any library calls get wrapped in a method that is either templated on the string input type(s) or takes a BaseString* and calls virtual conversion functions. Or force a manual call to convert each time so that your fellow developers know when slow conversions are happening for sure.

It is a crappy situation though. Pick where you want your pain point to be.

See http://utf8everywhere.org

It talks about Windows quite a bit.