Hacker News new | ask | show | jobs
by gecko 1612 days ago
Generically, anything by Microsoft will historically have used UCS-2, and will use UTF-16 these days, so this is utterly unsurprising to me as an experienced Windows dev. Conversely, Linux (and POSIX, more generally) deciding that filename encoding is a per-filename and untracked thing is a bit lit, from my perspective. Point being: when it comes to handling unicode and foreign characters, just, like...always read the documentation. Assume nothing.
1 comments

Yep. Microsoft is the main reason for 2 of my favourite unicode links. https://utf8everywhere.org/ and https://simonsapin.github.io/wtf-8/

And, apparently it's mildly inaccurate to say it uses UTF-16... it's more like UCS-2 with UTF-16 hacked in, with no validation. Thus WTF-8.

No, it's UTF-16 with no validation at the kernel level. Invalid UTF-16 is also invalid UCS-2 as those code points were explicitly barred from use.

In practice, only malware will create such broken names. High level software (e.g. Microsoft's own VSCode) will not handle broken UTF-16. And indeed the in-built UTF-8 code page will lossily decode UTF-16 (unpaired surrogates are replaced with the Unicode replacement character).

Hm. Are you sure? Because the utf8everywhere article (and various microsoft related framework discussions) seem to suggest there's no validation anywhere. You can easily create partial codepoints and just hitting backspace in a text field can do it. That seems to imply there's no UTF-16 validation even at a higher level.

But I will readily defer to your expertise on this. I've not coded in microsoft land for like 18 years. MFC was my last experience in this, where I still have this vague memory of being shocked by an API returning an int32 and instructing on casting to a void pointer (overloaded response message). No wonder they had issues with 64 bit migration at the time.

Edit: cite on the utf8everywhere thing. "in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7, the console displays such characters as two invalid characters, regardless of the font being used."

Maybe they've improved since though. But surely there's a lot of that baggage in the libraries.

I mean, Vista is ~15 years old at this point. If anything that's still part of Windows makes the backspace mistake then if nothing else it's impressive it's survived this long without being noticed.
Windows Vista came out fifteen years ago.