Hacker News new | ask | show | jobs
by JoachimSchipper 3082 days ago
Elaborating on the NUL bytes on the command-line, e.g. find -print0 | xargs -0:

Using find -print0 etc. is a good idea not so much because NUL is an uncommon character (the various record separators / vertical tab / ... are no more common), but because UNIX - being a C system through and through - allows any character to appear in a file name except '/' (path separator) and NUL. Thus, NUL makes a perfect separator between filenames.

1 comments

Then it sounds like 'find' f'ed up, if, when these things are passed around, they are not escaped properly (not saying this is the case). Just like today with various charsets, whenever there is a charset boundary, say between bytes and C library strings, which is what this is, there has to be a charset conversion.
By default, find separates by newline; this is human-friendly, but breaks if an attacker/script/... puts a newline in the filename.

The UNIX filesystem, qua filesystem, doesn't have a character set, just NUL-terminated strings. On the plus side, it's simple to handle, and means that retrofitting UTF-8 or another encoding is pretty easy. On the downside, two bytestrings that Unicode-canonicalize to the same value may name different files, which is surprising for humans.

It's notable that many of early UNIX' competitors were much more full-fledged systems, featuring full-fledged record-oriented files and typed data instead of UNIX' bytestrings-everywhere approach.