Hacker News new | ask | show | jobs
by enriquto 1538 days ago
Indeed. The unix world would be a much happier place if the creat system call normalized the strings it receives to replace literal spaces with non-breaking spaces, and similar stuff. Regular users wouldn't notice, and it would simplify tons of shell scripts.
1 comments

> if the creat system call normalized the strings it receives

Nowadays, there’s an understanding to assume those bytes are strings encoded in some ISO-8859 variant or UTF-8, but technically, the creat system call doesn’t receive strings; it receives byte arrays.

Historically, that was the (somewhat) right decision because it meant file systems didn’t need to know much about character encodings (they only needed to know the byte value of ‘/‘ and that zero is the name terminator), giving you a nice separation of concerns.

With Unicode, if you want to normalize names on write, or even only reject incorrectly normalized names, or have case-insensitive file names, your file system code needs to know a lot of Unicode. That can be problematic on small embedded systems.

I guess they could make it a compile flag
And have tool 1 happily create files that, according to tool 2, compiled with a different flag setting, cannot exist or create multiple files in a directory that, according to tool 2 have the same names?

I know two sort-of examples of this. Firstly, there’s MS-DOS long file names. That’s a hack (in the positive sense of the word) that gives code that doesn’t know about long file names the 8.3 file names that it expects.

It works, but code that isn’t aware of long file names will only write 8.3 ones, so even a simple file copy using an old copy tool will drop long file names.

Secondly, there’s macOS. It has a minor thing with directory separators. The Unix layer thinks ‘/‘ is the directory separator, old Mac code thinks it’s ‘:’. That works relatively well, mostly because old style Mac code doesn’t use file paths in the UI.

And of course, every desktop OS has to deal with it when it mounts various drives that handle file names differently. HFS+ used NFD normalization, most other file systems that normalize names use NFC, disks may be case preserving or not, case insensitive or not, normalization insensitive or not, etc.

On Unix-like OSes a path /a/b/c/d/e/f can walk six different file systems, each with different rules (even without soft or hard links). It wouldn’t surprise me to find bugs in programs there.

well, you'd just be in the same situation we're already in: you might not find the file by typing the name. you'd have to support files created before this flag was toggled on anyway, so it's not like everyone would just start assuming all files follow the convention