Hacker News new | ask | show | jobs
by WaitWaitWha 165 days ago
> To solve these problems, Umbra, the research predecessor of CedarDB, invented what Andy Pavlo now affectionately (we assume ;)) calls “German-style strings”.

This is how Borland Turbo Pascal stored strings as far back as the first version in mid-80s.

Length followed by the string.

4 comments

I think is about the kind of union they use, to store it differently depending on the string length, not the fact of length+data. Anyway is/was also nothing remotely new (the idea) as many lisp and scheme implementations have done so for strings and numbers basically for ages.
German-style strings is a way to store array of strings for columnar dbs. The idea is to have an array of metadata. Metadata has a fixed size (16 bytes) The metadata includes the string length and either a pair of pointer + string prefix or the full string for short strings. For some operations the string prefix is enough in many cases avoiding the indirection.

This is different from Pascal strings.

Storing the prefix and the tagged union of pointer and inline data structure is big difference to Pascal strings though.
That's not what it's doing though.

Pascal strings are: { length, pointer }

In these strings:

For short strings it's storing:

  { length, string value}
for longer strings, it's storing

  {length, prefix, class, pointer }
> Pascal strings are: { length, pointer }

The historical P-strings are just a pointer, with the length at the head of the buffer. Hence length-prefixed strings, and their limitation to 255 bytes (only one byte was reserved for the length, you can still see this in the most base string of freepascal: https://www.freepascal.org/docs-html/ref/refsu9.html).

    {length, pointer}
or

    {length, capacity, pointer}
is struct / record strings, and what pretty much every modern language does (possibly with optimisations e.g. SSO23 is basically a p-string when inline, but can move out of line into a full record string).