Hacker News new | ask | show | jobs
by saint_fiasco 3082 days ago
Why would you want to represent a null byte in a string? Is there a character encoding where the null value has a meaning?
2 comments

Interesting some Unix command line utilities will send null separated records if you pass a flag (often -0) because it's the least likely character to show up as part of the string.

find and xargs are examples of programs with this feature.

It depends a bit on what you call a "string". If you're thinking "something a human will want to read", then yeah, there's no much need to encode null. If however you take a looser view of "an 8 bit vector" then encoding null becomes important. Otherwise your system can't be 8 bit clean.

Overall I think the null terminator has caused more problems than it has solved, but prefixing the string length isn't a panacea either. You end up with systems with 256, 65536, or even 4294967296 byte limits on their strings. It's also more difficult to pass around an index into the string so you end up having to make lots of copies and then possibly merge them later or your language is cluttered with index values everywhere strings are used.

It's quite possible that if K&R had gone with length prefix strings that we would have a different class of errors where the string index gets offset or malicious values are inserted in the length field.

Elaborating on the NUL bytes on the command-line, e.g. find -print0 | xargs -0:

Using find -print0 etc. is a good idea not so much because NUL is an uncommon character (the various record separators / vertical tab / ... are no more common), but because UNIX - being a C system through and through - allows any character to appear in a file name except '/' (path separator) and NUL. Thus, NUL makes a perfect separator between filenames.

Then it sounds like 'find' f'ed up, if, when these things are passed around, they are not escaped properly (not saying this is the case). Just like today with various charsets, whenever there is a charset boundary, say between bytes and C library strings, which is what this is, there has to be a charset conversion.
By default, find separates by newline; this is human-friendly, but breaks if an attacker/script/... puts a newline in the filename.

The UNIX filesystem, qua filesystem, doesn't have a character set, just NUL-terminated strings. On the plus side, it's simple to handle, and means that retrofitting UTF-8 or another encoding is pretty easy. On the downside, two bytestrings that Unicode-canonicalize to the same value may name different files, which is surprising for humans.

It's notable that many of early UNIX' competitors were much more full-fledged systems, featuring full-fledged record-oriented files and typed data instead of UNIX' bytestrings-everywhere approach.

Because you just read it from a file and don't want to corrupt it?