Hacker News new | ask | show | jobs
by pdonis 2344 days ago
> the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support

File names are a different problem because Windows and Unix treat them differently: Unix treats them as bytes and Windows treats them as Unicode. So there is no single data model that will work for any language.

2 comments

The Rust standard library has a solution for this that actually works: On Unix-like systems file paths are sequences of bytes and most of the time the bytes are UTF-8. On Windows, they are WTF-8, so the API users sees a sequence of bytes and most of the time they match UTF-8.

This means that there's more overhead on Windows, but it's much better to normalize what the application programmer sees across POSIX and NT while still roundtripping all paths for both than to make the code unit size difference the application programmer's problem like the C++ file system API does.

> On Windows, they are WTF-8

Seems like an apt acronym for Windows... :-)

On a more serious note, Python seems to have done something fairly similar with the pathlib standard library module.

Not to mention case-sensitivity issues. Can you have two files, one named "FILE.txt" and the other "file.txt" in the same directory for instance?
On windows? Of course you can.
I'm certain you can on Linux as well. Only Macs old HFS would not allow it.
Isn't this a fairly recent change?
NTFS has always been case sensitive, Windows API just lets you treat it as case insensitive. If you pass `FILE_FLAG_POSIX_SEMANTICS` to `CreateFile` you can make files that differ only in case.
Good luck using those in some tools which use the API differently though. Windows filenames are endless fun. What's the maximum length of the absolute path of a file? Why, that depends on which API you're using to access it!
Even worse on Unix where it depends on the mount type. Haven't seen much proper long filename support in Unix apps or libs, it's much better in Windows land. Garbage in garbage out is also a security nightmare as names are not identifiable anymore. You can easily spoof such names.