| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by derefr 4509 days ago

You know what's nicer than delimiting beginnings and ends of things? Length prefixing. Protocol message formats and data encoding formats both already know what they're going to say before they say it, and so know its octet length.

The only reason to use delimiters, ever, is for user-modifiable data (e.g. source code) where you might want to insert or delete characters and have the containing block remain valid.

---

And now, a fun tangent, to prove that how deeply-rooted this confusion is in CS: user-modifiable data was originally the sole use-case for \0-terminated "C strings" in C.

C has two separate types which get conflated nowadays: char arrays, and \0-terminated strings. Most "strings"--as we'd expect to find them in other languages--were, in C, actually char arrays: you knew their length, either because they were string literals and you could sizeof them, or because you had #defined both FOO and FOO_LEN, or because you had just allocated len bytes on the heap for foo, so you could just pass len along with foo. Because you knew their length, you didn't need to use the string.h functions to manipulate them. It was idiomatic (and perfectly-safe) C, when dealing with char arrays, to just iterate through them with a for loop.

The concept of \0-termination, and thus what we think of as "C strings", only applied to string buffers: fixed-size, stack-allocated, uninitialized char arrays. The string.h functions are all meant to be employed to manipulate string buffers, and the \0 is intended to mark where the buffer stops being useful data, and starts being uninitialized garbage.

The strings in string buffers had short lifetimes, and didn't usually outlive the stack frame the buffer was declared in. Generally, you'd declare a string buffer, populate it using some combination of string literals, strcat(3), sprintf(3), and system calls, and then pass the string--still sitting inside the buffer--to a system call like fstat(2) to get what you're really after. That would be the end of the both string buffer's, and the string's, lifetime.

If you ever did want to preserve the contents of a string buffer into something you could pass around, though, this would be idiomatic:

    int give_me_a_path_string(char **out)
    {
      char buf[MAX_PATH];

      /* ... */

      int len = strlen(buf);
      *out = memcpy(malloc(len), buf, len);

      return len;
    }

Note that, after this function returns, the pointer it has written to doesn't point to a "C string": instead, it's a plain pointer to a heap-allocated array of char, with exactly enough space to hold just those characters. If you want to know how big it is, you look at the return value.

So:

• C has "C strings", but they were only intended as buffers.

• C also has "char arrays", which are really what you should think of as C's equivalent to a "string" datatype. char arrays, not "C strings", are the fundamental data structure for representing and persisting strings in C.

• char arrays are less like "C strings" than they are like Pascal strings: they come in two parts, a block of memory N chars wide, and an int containing N. You don't examine the block to determine the length; the length is explicit.

• Pascal (and thus most modern languages with strings) put both the length and the character-block on the heap as a unit. C puts the character-block on the heap, but puts the length on the stack. This is more efficient under C's Unix-rooted assumptions: you need the length on the stack if you want to work with it to immediately shove the string through a pipe.

1 comments

Pxtl 4509 days ago

The problem: I have never encountered length-prefixed data. Ever. Every data interchange file I've ever dealt with has been either delimited or fixed-width fields (and the widths are not defined anywhere in the file).

link

derefr 4509 days ago

Examples of length-prefixed data abound in protocols and formats defined by systems and telecom engineers (e.g. the IETF). IP packets are length-prefixed. ELF-binary tables and sections are length-prefixed. PNG chunks are length-prefixed.

It's just these worse-is-better text-based protocols like HTTP, created by application developers, that toss all the advantages of length-prefixing away. (And, even then, HTTP bodies are length-prefixed, with the Content-Length header. It's just the headers that aren't.)

link

mikeash 4508 days ago

The only problem with length prefixing is that it interferes with streaming data, because you need to know the full length in advance. Thus HTTP chunked encoding. Still, it works great in most scenarios.

My favorite way to deal with this stuff is Consistent Overhead Byte Stuffing:

http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffi...

In short, you take the data and encode it with a clever scheme that effectively escapes all the zero bytes. The output data contains no zeroes, but results in almost no overhead, with the worst case being an increase of 1/254 over the original size, and the best case being zero increase. (Compare to e.g. backslash escapes of quotes in quoted strings, where the worst case doubles the output size.) You then use the now-eliminated zero byte as your record separator. This lets you stream data (with a small amount of buffering to perform the encoding) while still easily locating the ends of chunks.

I've played around with COBS but never used it in a real product, so this is not entirely the voice of experience here. But it is a nifty system.

link

penguindev 4507 days ago

that is just freaking cool. took me about 4 times to grok it. it sort of reminds me of utf-8, and how you can synchronize that easily.

link

com2kid 4509 days ago

In contrast, I just got done designing an internal protocol today that has a length prefix.

My team pretty much length prefixes everything. :)

link