Hacker News new | ask | show | jobs
by msla 1285 days ago
My point is that assuming UNIX paths are UTF-8 by default is fragile, and that assuming UNIX paths have any consistent character encoding is also somewhat fragile. You can check to see if the path is UTF-8, but, otherwise, mitts off unless the user explicitly tells you an encoding.
1 comments

> assuming UNIX paths are UTF-8 by default is fragile, and that assuming UNIX paths have any consistent character encoding is also somewhat fragile.

Again, we're both aware that it's not guaranteed. But the convention these days is nonetheless UTF-8.

> mitts off unless the user explicitly tells you an encoding.

This doesn't adequately solve the problem, though. Typed languages have to emit some type of value: having that be the "string" type of the language is useful, as you can do things like printf a message that includes the filename. (Or display an "Open file…" dialog. Or…)

There are middle-grounds, such as byte smuggling and escaping. But if you take the stance that filenames are arbitrary bags of bytes (which is actually a subset: the reality is even worse) — then anything that returns a filename is stuck: it can't return a string ("mitts off"). You can take Rust's approach with Path (a type specific to Paths) but people wail about that all the time too ("why are there so many types?"), and you can't print it because it's not text!

"Error out if the file name isn't conventional" is a pragmatic tradeoff: bad file names will cause errors, but it makes basically all other operations much more tractable. It's not worth supporting insane file names.

There are workarounds, of course (such as just replacement-charactering "�" anything that can't be understood), and finding a format that can encode Paths when transmitting them, but these all take more time and effort. Allowing non-text files introduces unnecessary complexity and bugs into every single program that needs to deal with the file system.