Hacker News new | ask | show | jobs
by jerf 3082 days ago
"A string contains characters. NUL is not a character; it's nothing."

You already couldn't make this argument stick in the ASCII era, where a string can't contain NUL but can contain SOH (Start of Heading), STX (Start of Text), ETX (End of Text), EOT (End of Transmission), ENQ (Enquiry), ACK (Acknowledge), BEL, BS, HT (horizontal tab), LF, VT (vertical tab), FF (form feed), CR, SO (shift out), SI (shift in), DLE (data link escape), DC1, DC2, DC3, DC4 (device control 1-4), NAK (negative ACK), SYN (synchronous idle), ETB (end of transmission block), CAN (cancel), EM (end of medium), SUB (substitute), ESC (escape), FS (file separator), GS (group separator), RS (record separator), US (unit separator), and DEL, but Unicode makes that argument even sillier. Strings have always contained things that aren't "characters".

The real problem is no matter what in-band character you take as the magical termination character, you will have strings that want that in it, because in the general case strings can contain anything, because C is always asking you to pass them around to things as the general-purpose storage data structure. You can fix that with an escaping scheme, but now you have an escaped string, not just "a string". Since strings do indeed need to be able to carry NUL in the general case, you either must have some sort of scheme for representing them, or expect a ton of errors when things jam the distinguished character into your string when you didn't expect it. (Note that for precisely the same reasons that NUL-termination isn't a good idea, there isn't any way to "filter" wrong NULs. You can't tell.)

You might just barely be able to argue the problem is that C's library mistook NUL-terminated strings for arbitrary-sized arrays that can contain anything, but in C if you want arbitrarily-sized arrays you would then have no choice but to pass the array size around to every call that expected such a thing. The next immediately obvious thing to do is to pack the number together with the array in a struct, and lo, we're back to length-delimited strings.

No matter how you slice it, C's got a major foundational screw-up in this area somewhere. If NUL-terminated strings are the bee's knees, C's APIs still took them in way too many places where they are not appropriate, and it caused decades of serious and often exploitable bugs.

3 comments

> but Unicode makes that argument even sillier

Unicode Standard (version 10.0, section 23.1 Control Codes) makes it clear that it "specifies semantics for the use" of only 9 of those ASCII control codes you mentioned, i.e. U+0009 to U+000D (HT, LF, VT, FF, CR) and U+001C to U+001F (RS, GS, RS, US). The rest of the 65 ASCII and Latin-1 control codes, except U+0085 (NEL), "constitute a higher-level protocol that is outside the scope of the Unicode Standard".

Particularly about NUL, it says: "U+0000 null may be used as a Unicode string terminator, as in the C language. Such usage is outside the scope of the Unicode Standard, which does not require any particular formal language representation of a string or any particular usage of null."

So Unicode makes that argument less silly.

NUL is a character in the ASCII character set. That is a problem because you cannot create all the strings composed of ASCII characters in C.

But C never claimed to support all ASCII strings. C doesn't even have strings. C just has char arrays, which are byte arrays. When strings were formalized by convention in the stdlibs, clearly the supported strings are 1-255 strings, NUL excluded. That's the character set available for strings in the stdlibs. If you insist on using stdlib strings for some other kind of strings, that's your own problem.

"But C never claimed to support all ASCII strings."

That is precisely my point... there is no well-supported solution in core C for arbitrary binary strings, despite C's extremely frequent use in domains that require them. If you insist on using stdlib strings for other kinds of strings, you do have a problem... but you also have no other choice. Which brings it back to being a language/library problem.

As I already alluded to, C itself doesn't have a problem with length-delimited strings, and there are plenty of libraries you can get for them. But the core library for C does force this problem in your face by leaving you no other choice, and it is a valid criticism of C.

(C is such a disaster that the only thing to do is to leave it behind as quickly as possible. However, if we were somehow stuck with the language itself, there's a lot of ways we could improve the libraries it comes with, as again demonstrated by the many such improved libraries you can get. However, one of the things I've learned from learning a ton of languages over the past couple of decades is that a language almost never manages to escape from its own standard library, and the few that manage it (like D) pay a stiff adoption price in the process. C's standard library has a real problem here, that has caused real bugs, and no amount of wordplay is going to fix those decades of bugs.)

Good point. And I would agree that the error lies in choosing to use strings for inappropriate places.

Also, ETX might have been a good terminator :) I assume NUL was chosen for easier checking (if (char) ...) vs (if (char == 0x03) ...)

But my argument was against length prefixing somehow being "more fundamental" than having just a sequence of characters "raw" in memory addresses.