Hacker News new | ask | show | jobs
by function_seven 3082 days ago
A string contains characters. NUL is not a character; it's nothing.

"Fundamental" in this case means "matches reality". Having a number at the beginning doesn't match reality as closely as having the string of characters in sequential memory addresses with something to terminate them.

The quick fox made the jump\N

or

27The quick fox made the jump

The second one requires more work to store (a character-counting routine), and needs even more work to handle variable length strings that may exceed 255-ish bytes/characters.

I'm not discounting the benefits of prefixing the length, just saying it's not more fundamental than null-terminating an arbitrary sequence of characters.

3 comments

"A string contains characters. NUL is not a character; it's nothing."

You already couldn't make this argument stick in the ASCII era, where a string can't contain NUL but can contain SOH (Start of Heading), STX (Start of Text), ETX (End of Text), EOT (End of Transmission), ENQ (Enquiry), ACK (Acknowledge), BEL, BS, HT (horizontal tab), LF, VT (vertical tab), FF (form feed), CR, SO (shift out), SI (shift in), DLE (data link escape), DC1, DC2, DC3, DC4 (device control 1-4), NAK (negative ACK), SYN (synchronous idle), ETB (end of transmission block), CAN (cancel), EM (end of medium), SUB (substitute), ESC (escape), FS (file separator), GS (group separator), RS (record separator), US (unit separator), and DEL, but Unicode makes that argument even sillier. Strings have always contained things that aren't "characters".

The real problem is no matter what in-band character you take as the magical termination character, you will have strings that want that in it, because in the general case strings can contain anything, because C is always asking you to pass them around to things as the general-purpose storage data structure. You can fix that with an escaping scheme, but now you have an escaped string, not just "a string". Since strings do indeed need to be able to carry NUL in the general case, you either must have some sort of scheme for representing them, or expect a ton of errors when things jam the distinguished character into your string when you didn't expect it. (Note that for precisely the same reasons that NUL-termination isn't a good idea, there isn't any way to "filter" wrong NULs. You can't tell.)

You might just barely be able to argue the problem is that C's library mistook NUL-terminated strings for arbitrary-sized arrays that can contain anything, but in C if you want arbitrarily-sized arrays you would then have no choice but to pass the array size around to every call that expected such a thing. The next immediately obvious thing to do is to pack the number together with the array in a struct, and lo, we're back to length-delimited strings.

No matter how you slice it, C's got a major foundational screw-up in this area somewhere. If NUL-terminated strings are the bee's knees, C's APIs still took them in way too many places where they are not appropriate, and it caused decades of serious and often exploitable bugs.

> but Unicode makes that argument even sillier

Unicode Standard (version 10.0, section 23.1 Control Codes) makes it clear that it "specifies semantics for the use" of only 9 of those ASCII control codes you mentioned, i.e. U+0009 to U+000D (HT, LF, VT, FF, CR) and U+001C to U+001F (RS, GS, RS, US). The rest of the 65 ASCII and Latin-1 control codes, except U+0085 (NEL), "constitute a higher-level protocol that is outside the scope of the Unicode Standard".

Particularly about NUL, it says: "U+0000 null may be used as a Unicode string terminator, as in the C language. Such usage is outside the scope of the Unicode Standard, which does not require any particular formal language representation of a string or any particular usage of null."

So Unicode makes that argument less silly.

NUL is a character in the ASCII character set. That is a problem because you cannot create all the strings composed of ASCII characters in C.

But C never claimed to support all ASCII strings. C doesn't even have strings. C just has char arrays, which are byte arrays. When strings were formalized by convention in the stdlibs, clearly the supported strings are 1-255 strings, NUL excluded. That's the character set available for strings in the stdlibs. If you insist on using stdlib strings for some other kind of strings, that's your own problem.

"But C never claimed to support all ASCII strings."

That is precisely my point... there is no well-supported solution in core C for arbitrary binary strings, despite C's extremely frequent use in domains that require them. If you insist on using stdlib strings for other kinds of strings, you do have a problem... but you also have no other choice. Which brings it back to being a language/library problem.

As I already alluded to, C itself doesn't have a problem with length-delimited strings, and there are plenty of libraries you can get for them. But the core library for C does force this problem in your face by leaving you no other choice, and it is a valid criticism of C.

(C is such a disaster that the only thing to do is to leave it behind as quickly as possible. However, if we were somehow stuck with the language itself, there's a lot of ways we could improve the libraries it comes with, as again demonstrated by the many such improved libraries you can get. However, one of the things I've learned from learning a ton of languages over the past couple of decades is that a language almost never manages to escape from its own standard library, and the few that manage it (like D) pay a stiff adoption price in the process. C's standard library has a real problem here, that has caused real bugs, and no amount of wordplay is going to fix those decades of bugs.)

Good point. And I would agree that the error lies in choosing to use strings for inappropriate places.

Also, ETX might have been a good terminator :) I assume NUL was chosen for easier checking (if (char) ...) vs (if (char == 0x03) ...)

But my argument was against length prefixing somehow being "more fundamental" than having just a sequence of characters "raw" in memory addresses.

None of it really "matches reality." It's all binary numbers, and on a deeper level, voltages or magnetized particles.

0 is not a letter of the alphabet, but nor is 01000001 (ascii 'a').

So either the first number is special, or you look for a special number to indicate the end. Neither represents reality, because the "end" of a single group of characters is visually identical to a million white-space characters that happen to fit into the emptiness that follows.

My point being, it's probably not helpful to argue which "matches reality" when they're both just abstract representations of concepts.

I was going more toward "closer to reality". But I take your point. Somewhere we're going to need extra info about the string itself, whether that extra info is a magic terminator or a magic prefix. The magic prefix gives great benefit, but also is more complex to implement if you want to store an arbitrary-length string.
Most CPUs have a flags register, and typically have a "zero" flag which is set when the result of the last operation was zero. Zero is special in the vast majority of hardware designs. Checking for null (zero) instead of another specific value often saves a few cycles. That's where the optimization of having all FOR loops count down towards zero comes from, the check saves a cycle or two each iteration on some CPUs. The same thing happens when reading from a buffer, the load instruction will set the Zero flag when the terminating null is read.

The difference doesn't matter much on modern (non-embedded) processors, but it did make sense at the time C was designed. It matches the most common hardware design pattern better than the alternatives.

> NUL is not a character

Somehow NUL is still an assigned character in the ASCII code table. Strange, hmmm?

Ha, the fact that I spelled it "NUL" instead of using the word "null" should have made me pause. :)

Ok, it's an ASCII character code point. One that's used to terminate strings. I meant it's not a character you'd find in the middle of a string, though I realize that's kinda tautological. Back when ASCII was developed, punch cards were used. Any row in the card that wasn't punched was a NUL. It wouldn't have made sense to have it in the middle of a string. It would be like missing a character altogether.