Hacker News new | ask | show | jobs
by StefanKarpinski 1285 days ago
On UNIX, paths are UTF-8 by convention, but not forced to be valid. Treating paths as UTF-8 works very well as long as you hadn't also make the mistake of requiring your UTF-8 strings to be valid (which Python did, unfortunately).

On Windows, paths are UTF-16 by convention, but also not forced to be valid. However, invalid UTF-16 can be faithfully converted to WTF-8 and converted back losslessly, so you can translate Windows paths to WTF-16 and everything Just Works™ [1].

There aren't any operating systems I'm aware of where paths are actually Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by convention" strings works on all modern OSes.

[1] Ok, here's why the WTF-8 thing works so well. If we write WTF-16 for potentially invalid UTF-16 (just arbitrary sequences of 16-bit code units), then the mapping between WTF-16 and WTF-8 space is a bijection because it's losslessly round-trippable. But more importantly, this WTF-8/16 bijection is also a homomorphism with respect to pretty much any string operation you can think of. For example `utf16_concat(a, b) == utf8_concat(wtf8(a), wtf8(b))` for arbitrary UTF-16 strings a and b. Similar identities hold for other string operations like searching for substrings or splitting on specific strings.

3 comments

Just to clarify further, note that actually preserving the "as if" behaviour does involve some complexity in the implementation. E.g. appending to WTF-8 has to be handled carefully to ensure it remains truly the same as doing so with WTF-16. This is because any newly paired surrogate has to be converted to its proper UTF-8 encoding. Similarly splitting WTF-8 can potentially break apart what was valid UTF-8 (though I'm not totally convinced that there's a good use case for actually doing this, at least for Windows paths).

Of course the implementation details are something that can and should be handled by a library instead of doing it manually.

> On UNIX, paths are UTF-8 by convention, but not forced to be valid.

On UNIX, paths are a sequence of bytes, with two bytes being sacred to the kernel (0x2F, used to separate path elements, and 0x00, used to terminate paths) and no other bytes being interpreted in any way. Any character encoding which respects the sacred bytes by not using them to encode any other characters is therefore usable to make UNIX paths; in fact, a UNIX path can contain multiple encodings, as long as they're all suitably respectful.

That requirement for respect means that UTF-16 and UCS-2 and UCS-4 are not suitable. UTF-7 is, however, as is UTF-8, and all of the ISO/IEC 8859 encodings are as well, not to mention a whole raft of non-standard "extended ASCII" character sets. In theory, UTF-16 in some suitably respectful encoding would work, too, but gouge my eyes out with a goddamned spoon.

They know all that. The operative words there are "by convention", not "by requirement".

I.e., their comment is an RFC "OUGHT TO", not an RFC "MUST".

My point is that assuming UNIX paths are UTF-8 by default is fragile, and that assuming UNIX paths have any consistent character encoding is also somewhat fragile. You can check to see if the path is UTF-8, but, otherwise, mitts off unless the user explicitly tells you an encoding.
> assuming UNIX paths are UTF-8 by default is fragile, and that assuming UNIX paths have any consistent character encoding is also somewhat fragile.

Again, we're both aware that it's not guaranteed. But the convention these days is nonetheless UTF-8.

> mitts off unless the user explicitly tells you an encoding.

This doesn't adequately solve the problem, though. Typed languages have to emit some type of value: having that be the "string" type of the language is useful, as you can do things like printf a message that includes the filename. (Or display an "Open file…" dialog. Or…)

There are middle-grounds, such as byte smuggling and escaping. But if you take the stance that filenames are arbitrary bags of bytes (which is actually a subset: the reality is even worse) — then anything that returns a filename is stuck: it can't return a string ("mitts off"). You can take Rust's approach with Path (a type specific to Paths) but people wail about that all the time too ("why are there so many types?"), and you can't print it because it's not text!

"Error out if the file name isn't conventional" is a pragmatic tradeoff: bad file names will cause errors, but it makes basically all other operations much more tractable. It's not worth supporting insane file names.

There are workarounds, of course (such as just replacement-charactering "�" anything that can't be understood), and finding a format that can encode Paths when transmitting them, but these all take more time and effort. Allowing non-text files introduces unnecessary complexity and bugs into every single program that needs to deal with the file system.

> There aren't any operating systems I'm aware of where paths are actually Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by convention" strings works on all modern OSes.

Nonsense. Unix paths use the system locale by convention, and it's entirely normal for that to be Shift-JIS.