Hacker News new | ask | show | jobs
by hackits 3214 days ago
This is where we run into a little bit of a problem. You have a char pointer that can be either a multiple byte encoded (depending on the code page window is using). It also can be UTF-8 encoded. Then when you move onto windows wchar_t that is originally defined as (UCS-2) then was later renamed to UTF-16, due to surrogate pair's.

So in the windows world with COM/DCOM you're basically nugged into using UTF-16 wchar_t or it becomes a hell of a lot of pain. So it is easier just simply to accept to use UTF-16 and do all the conversion from UTF-8, UTF-32, code pages to a single encoding standard.

2 comments

You could just wrap that pointer in a class that describes what it is - ideally at the type level (Utf8String, etc). Each string class knows how to convert from other string types, and any library calls get wrapped in a method that is either templated on the string input type(s) or takes a BaseString* and calls virtual conversion functions. Or force a manual call to convert each time so that your fellow developers know when slow conversions are happening for sure.

It is a crappy situation though. Pick where you want your pain point to be.

See http://utf8everywhere.org

It talks about Windows quite a bit.