Hacker News new | ask | show | jobs
by haberman 3749 days ago
I have been advocating for the PrefixVarint encoding you mention for a while.

One thing I'd mention though: as you've specified it here, it puts the continuation bits as the high bits of the first byte. I think it may be better to put them in the lower bits of that byte instead. It would allow for a simple loop-based implementation of the encoder/decoder (LEB128 also allows this). With continuation bits in the high bits of the first byte, you pretty much have to unroll everything. You have to give each length its own individual code-path, with hard-coded constants for the shifts and continuation bits.

The downside is one extra shift of latency in the one-byte case, imposed on all encoders/decoders.

Unrolling is probably a good idea for optimization anyway, but it seems better to standardize on something that at least allows a simple implementation.

Here is some sample code for a loop-based implementation that uses low bits for continuation bits:

    // Little-endian only. Untested.
    char *encode(char *p, uint64 val) {
      int len = 1;
      uint64 encoded = val << 1;
      uint64 max = 1 << 7;
      while (val > max) {
        if (max == 1ULL << 63) {
          // Special case so 64 bits fits in 9 bytes.
          *p++ = 0xff;
          memcpy(p, &val, 8);
          return p + 8;
        }
        encoded = (encoded << 1) | 1;
        max <<= 7;
        len++;
      }
      memcpy(p, &encoded, len);
      return p + len;
    }

    const char *decode(const char *p, uint64* val) {
      if (*p == 0xff) {
        // 9-byte special case
        memcpy(val, p + 1, 8);
        return p + 9;
      }

      // Can optimize with something like
      //   int len = __builtin_ctz(!*p);
      unsigned char b = *p;
      int len = 1;
      while (b & 1) {
        len++;
        b >>= 1;
      }

      *val = 0;
      memcpy(val, p, len);
      *val >>= len;
      return p + len;
    }
1 comments

You can have an equally simple implementation (plus one mask operation) if you put the length encoding in the most significant bits. The advantage of having length in the most significant bit is that in the common case (1 byte integers), the decoding is faster.
Are you sure? It does not seem like it will be as simple. When continuation bits are at the top of the first byte, they come between the value bits in the first byte and value bits in the subsequent bytes. This means you have to manipulate them independently, instead of being able to manipulate them as an atomic group. With low continuation bits, all the value bits get to stay together.

If it would be as simple, you should be able to easily modify my sample encoder/decoder above to illustrate.

Oops. You're right for little-endian encoders. (See another comment on this thread for a simple bijective big-endian encoder I whipped up just now.) I've always written big-endian encoders or bijective big-endian encoders, so that byte strings sort the same lexographically and numerically.

Though, a simple loop encoder and decoder are still easily doable if the unary length encoding is in the most significant bits. You're right, though, for a little-endian encoder, it's a slightly more simple to put the unary length encoding in the least significant bits.