The Most Expensive One-byte Mistake (2011)

Y	Hacker News new \| ask \| show \| jobs

	The Most Expensive One-byte Mistake (2011) (queue.acm.org)
	8 points by eric59 4282 days ago

3 comments

DanBC 4282 days ago

link

lysium 4282 days ago

I find this article very interesting, but I'd still like to point out it is from 2011.

link

kazinator 4282 days ago

Betteridge's Law strikes again.

No, null-terminated strings are fine. Rants against null-terminated string are a good way to spot nutjobs.

Null-terminated strings have virtues, such as being recursively defined: the tail of a string is a string. So strchr could be written like this (let's drop the const for simplicity):

   char *strchr(char *s, int ch)
   {
     if (*s == ch)
       return s;

     if (*s == 0)
       return NULL;

     return strchr(s + 1, ch);
   }

It's easy to break a string with delimiters into the individual pieces in place, just by writing nulls over the separating characters, and keeping a vector of pointers to the pieces. This can't be done with some other string representations like length + data.

When one null terminated string is a suffix of another (and ideally both are treated as immutable), then they can share storage.

  char *excon = "excon", *con = excon + 2;

Catenating null terminating strings is efficient if you keep a tail pointer. A repeated strcat-like operation will be O(N*N) of course, so don't do that in some critical inner loop, with a large amount of text.

Length + data strings have various disadvantages. For one thing, how wide should be the length? Two bytes? Four? Eight? If you make it two bytes today and store binary data somewhere, it will be incompatible with tomorrow's four byte length. And then there is endiannness. A binary file with strings produced on one machine will have byte swapped lengths on another. Null terminated strings can be blasted over a serial line or network, or written to disks, as they are; they are already marshaled and ready to go! (Though I must hastily acknowledge that this isn't true of wide character null term'd strings, of course.)

Dynamic strings (management record plus pointer to data) are heavyweight representations that will show their weaknesses at virtual machine boundaries. You cannot pass them between address spaces or share them without marshaling to some flat form and back.

Also, regarding another point in the article, MS-DOS did not invent the backslash as a path separator instead of the slash. This is a common misconception. MS-DOS supports both forward and backslash as separators! And so does Windows (every version between then and now). Early versions of COMMAND.COM had a variable whereby you could set this as a preference: whether you want to display and input path separators as slash or as backslash. This was later removed.

Today, when you have trouble with forward slashes in Windows, this is due to the application you are using (including, sadly that application known as the Windows Explorer, and its "Shell API"). The underlying kernel handles the slashes just fine.

link

andreasvc 4282 days ago

I don't find the advantages of null-terminated strings you cite compelling. They sound like relatively rare operations, and compared to the severity and quantity of security problems caused by null-terminated strings it seems like a particularly bad trade-off. An advantage of storing strings with length and optionally start positions is that arbitrary slices can be defined, which reuse underlying data; this covers your examples of splitting by a delimiter and reusing suffixes. The argument that serialization would be harder is not really particular to the string format in question, that is always something that needs to be defined well, preferably with static typing (encoding issues make file formats complicated either way, if anything, storing a length attribute encourages to also store an encoding attribute while you're at it).

link

hyperliner 4282 days ago

Bingo: "Length + data strings have various disadvantages. For one thing, how wide should be the length? Two bytes? Four? Eight? "

link

andreasvc 4282 days ago

That's easy, it should be of type size_t. Yes that is wasteful for short strings, but then again I believe null-terminated strings are the most widespread and worst case of premature optimization, and correctness & safety should take precedence.

link

sjolsen 4282 days ago

Edit: It occurred to me after posting this that by "length + data" you might mean actually storing the length with the string data. If so, you're right, that's stupid. The correct solution (well, it's better than C/Pascal strings, anyway) is to store just the string data in the string proper, and use references that consist of pointer+length or pointer+pointer pairs instead of single char-pointers.

> It's easy to break a string with delimiters into the individual pieces in place, just by writing nulls over the separating characters

This assumes you're free to modify the input. If you don't want/are unable to modify the input, you're forced to allocate memory to hold the output; you've totally unnecessarily doubled the space requirement. It also assumes that there is a separating character to overwrite, which may not be the case, causing the same problem.

> just by writing nulls over the separating characters > This can't be done with some other string representations like length + data

Obviously. It is, however, perfectly possible to tokenize a string into a vector of length+data pairs without mangling the input.

> When one null terminated string is a suffix of another (and ideally both are treated as immutable), then they can share storage.

And only when one is a suffix of another. If you want to share storage between strings between which this relationship does not hold, you must again allocate extra memory, and again this is not necessary if you use a data+length representation.

> Catenating null terminating strings is efficient if you keep a tail pointer

In other words, if you use a pointer+length or arithmetically equivalent representation—except that you don't even bother binding the data into a coherent data structure, and instead force any code without access to your local tail-pointer variable to derive the length/end of the string on its own.

> For one thing, how wide should be the length?

Wide enough to represent the length of the longest possible string. Generally, this is less than or equal to the length of a pointer.

> If you make it two bytes today and store binary data somewhere, it will be incompatible with tomorrow's four byte length > Null terminated strings can be blasted over a serial line or network, or written to disks, as they are

You are making the incorrect assumption that the most appropriate on-disk/serialization representation is necessarily the same as the most appropriate in-memory representation.

> And then there is endiannness

1. See above.

2. You're right that "this isn't true of wide character null term'd strings," so unless you and everyone you communicate with are still living in 1985, NTBSes don't even have that advantage.

> You cannot pass them between address spaces or share them without marshaling to some flat form and back

You can't do this with NTBSes, either, unless you have a way of translating the address of the string across address spaces, in which case you could just as well pass a pointer+length as you could pass a pointer.

NTBSes are a perfectly fine representation if: your string is known at compile time; your string will never be modified; your string will only be scanned through from front to back; and your string does not contain a null character. From a technical standpoint, the only reason to bother with them in the first place is that you have extremely tight memory requirements and can't even afford the extra byte or three needed for a sane string-reference representation.

If any of these conditions does not hold, NTBSes are a Terrible, Horrible, No Good, Very Bad representation. All they accomplish in practice is to make strlen a linear-time operation, defeat a great number of data-sharing opportunities, and generally plague the world with buffer overruns.

link

kazinator 4281 days ago

What I mean by shared memory is that the null-terminated strnig is self-describing. If two processes attach a region of shared memory (at a different virtual address in each process), one process can put a null-terminated string into that memory, and the other process sees a string object. No length or pointer value have to be communicated, just the bytes of the self-describing string object itself.

This is not the case with your proposed string representation.

I never claimed that there is a best string representation, or that the same representation should be used externally and internally. (I can prove that I don't believe that by pointing to programs I have written which don't do that; for instance my TXR language, written in C, has garbage collected strings that are arrays of wide characters internally, all I/O is done using UTF-8, and most string operations are applicative rather than destructive. Yet, those underlying character arrays are null terminated, for pragmatic reasons of environmental interoperability.)

The "fat reference" representation of strings is interesting, but has the disadvantage of splitting the string into two (or more) objects: the data, which is in one location, and meta-data which is in another location. This is fine for internal representations, but creates obstacles if we do want to use this as an external representation. Yet, by itself, it is not a sufficiently robust internal representation that it warrant outrageous claims of superior safety: after all, the underlying storage is a simple C-like array which doesn't know how long it is. It is not null terminated strings which cause buffer overflows, but the underlying unchecked array data type.

What's good about the representation is that you can play even more representational tricks, like have N references to the same storage, all representing different strings. (This is prone to errors if the storage is mutated; I don't know how you can possibly accuse null terminated strings for only being useful and safe when they are compile-time immutable, while in the same posting you propose a hack that is more similar than it is different.)

"Fat reference strings" are subject to buffer overruns (length field being wrong), aliasing problems (deleting a character from a string but not decrementing the length in all of the references that exist), dangling pointer problems (references to storage that has been deleted) and so on. (Note by the way that null terminated strings at least have the property that if we have multiple pointers aimed at the same string, and we edit that string in place, all the pointers see the new string, and not some half-baked string. We can even insert characters, if the underlying buffer has enough slack.)

It is not reasonable to believe that programmers who make mistakes when programming with null-terminated strings will suddenly write correct code when using fat-referenced strings.

If you use a language like C++ instead of C, and represent these fat references as smart pointer classes, then you can achieve a lot of safety. Basically C++'s std::basic_string template can already be implemented this way. But then you're tied to a particular memory management scheme and programming language. Also, safely and correctly implementing those tricks whereby one string's storage is displaced into another one, requires a lot more baggage in the smart pointer. std::basic_string implementations typically use reference counting to manage the lifetime of the underlying array, and so the underlying array needs a refcount. And so it goes.

In any case, null-terminated strings serve requirements which are poorly suited by such managed strings.

link