Hacker News new | ask | show | jobs
by pishpash 3082 days ago
Then it sounds like 'find' f'ed up, if, when these things are passed around, they are not escaped properly (not saying this is the case). Just like today with various charsets, whenever there is a charset boundary, say between bytes and C library strings, which is what this is, there has to be a charset conversion.
1 comments

By default, find separates by newline; this is human-friendly, but breaks if an attacker/script/... puts a newline in the filename.

The UNIX filesystem, qua filesystem, doesn't have a character set, just NUL-terminated strings. On the plus side, it's simple to handle, and means that retrofitting UTF-8 or another encoding is pretty easy. On the downside, two bytestrings that Unicode-canonicalize to the same value may name different files, which is surprising for humans.

It's notable that many of early UNIX' competitors were much more full-fledged systems, featuring full-fledged record-oriented files and typed data instead of UNIX' bytestrings-everywhere approach.