|
"A string contains characters. NUL is not a character; it's nothing." You already couldn't make this argument stick in the ASCII era, where a string can't contain NUL but can contain SOH (Start of Heading), STX (Start of Text), ETX (End of Text), EOT (End of Transmission), ENQ (Enquiry), ACK (Acknowledge), BEL, BS, HT (horizontal tab), LF, VT (vertical tab), FF (form feed), CR, SO (shift out), SI (shift in), DLE (data link escape), DC1, DC2, DC3, DC4 (device control 1-4), NAK (negative ACK), SYN (synchronous idle), ETB (end of transmission block), CAN (cancel), EM (end of medium), SUB (substitute), ESC (escape), FS (file separator), GS (group separator), RS (record separator), US (unit separator), and DEL, but Unicode makes that argument even sillier. Strings have always contained things that aren't "characters". The real problem is no matter what in-band character you take as the magical termination character, you will have strings that want that in it, because in the general case strings can contain anything, because C is always asking you to pass them around to things as the general-purpose storage data structure. You can fix that with an escaping scheme, but now you have an escaped string, not just "a string". Since strings do indeed need to be able to carry NUL in the general case, you either must have some sort of scheme for representing them, or expect a ton of errors when things jam the distinguished character into your string when you didn't expect it. (Note that for precisely the same reasons that NUL-termination isn't a good idea, there isn't any way to "filter" wrong NULs. You can't tell.) You might just barely be able to argue the problem is that C's library mistook NUL-terminated strings for arbitrary-sized arrays that can contain anything, but in C if you want arbitrarily-sized arrays you would then have no choice but to pass the array size around to every call that expected such a thing. The next immediately obvious thing to do is to pack the number together with the array in a struct, and lo, we're back to length-delimited strings. No matter how you slice it, C's got a major foundational screw-up in this area somewhere. If NUL-terminated strings are the bee's knees, C's APIs still took them in way too many places where they are not appropriate, and it caused decades of serious and often exploitable bugs. |
Unicode Standard (version 10.0, section 23.1 Control Codes) makes it clear that it "specifies semantics for the use" of only 9 of those ASCII control codes you mentioned, i.e. U+0009 to U+000D (HT, LF, VT, FF, CR) and U+001C to U+001F (RS, GS, RS, US). The rest of the 65 ASCII and Latin-1 control codes, except U+0085 (NEL), "constitute a higher-level protocol that is outside the scope of the Unicode Standard".
Particularly about NUL, it says: "U+0000 null may be used as a Unicode string terminator, as in the C language. Such usage is outside the scope of the Unicode Standard, which does not require any particular formal language representation of a string or any particular usage of null."
So Unicode makes that argument less silly.