|
|
|
|
|
by StefanKarpinski
1285 days ago
|
|
On UNIX, paths are UTF-8 by convention, but not forced to be valid. Treating paths as UTF-8 works very well as long as you hadn't also make the mistake of requiring your UTF-8 strings to be valid (which Python did, unfortunately). On Windows, paths are UTF-16 by convention, but also not forced to be valid. However, invalid UTF-16 can be faithfully converted to WTF-8 and converted back losslessly, so you can translate Windows paths to WTF-16 and everything Just Works™ [1]. There aren't any operating systems I'm aware of where paths are actually Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by convention" strings works on all modern OSes. [1] Ok, here's why the WTF-8 thing works so well. If we write WTF-16 for potentially invalid UTF-16 (just arbitrary sequences of 16-bit code units), then the mapping between WTF-16 and WTF-8 space is a bijection because it's losslessly round-trippable. But more importantly, this WTF-8/16 bijection is also a homomorphism with respect to pretty much any string operation you can think of. For example `utf16_concat(a, b) == utf8_concat(wtf8(a), wtf8(b))` for arbitrary UTF-16 strings a and b. Similar identities hold for other string operations like searching for substrings or splitting on specific strings. |
|
Of course the implementation details are something that can and should be handled by a library instead of doing it manually.