| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by KMag 3795 days ago

About the binary encoding... It's a bit easy to armchair these things, and it's too late for WebAsm now... but if you're on the V8 team, you have access to Google's PrefixVarint implementation (originally by Doug Rhode, IIRC from my time as a Google engineer). A 128-bit prefix varint is exactly as big as an LEB128 int in all cases, but is dramatically faster to decode and encode. It's closely related to the encoding used by UTF-8. Doug benchmarked PrefixVarints and found both Protocol Buffer encoding and Protocol Buffer decoding would be significantly faster if they had thought of using a UTF-8-like encoding.

LEB128 requires a mask operation and a branch operation on every single byte, maybe skipping the final byte, so 127 mask operations and 127 branches. Using 32-bit or 64-bit native loads gets tricky, and I suspect all of the bit twiddling necessary makes it slower than the naive byte-at-a-time mask-and-branch.

    7 bits -> 0xxxxxxx
    14 bits -> 1xxxxxxx 0xxxxxxx
    ...
    35 bits -> 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx
    ...
    128 bits -> 1xxxxxxx 1xxxxxxx 1xxxxxxx ... xxxxxxxx

Prefix varints just shift that unary encoding to the front, so you have at most 2 single-byte switch statements, for less branch misprediction, and for larger sizes it's trivial make use of the processor's native 32-bit and 64-bit load instructions (assuming a processor that supports unaligned loads).

    7 bits -> 0xxxxxxx
    14 bits -> 10xxxxxx xxxxxxxx
    ...
    35 bits -> 11110xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
    ...
    128 bits -> 11111111 11111111 xxxxxxxx xxxxxxxx ... xxxxxxxx

There's literally no advantage to LEB128, other than more people have heard about it. A PrefixVarInt 128 is literally always the same number of bytes, it just puts the length-encoding bits all together so you can more easily branch on them, and doesn't make them get in the way of native loads for your data bits.

Also, zigzag encoding and decoding is faster than sign extension, for variable-length integers. Protocol Buffers got that part right.

Note that for security reasons, if there are no non-canonical representations, there can't be security bugs due to developers forgetting to check non-canonical representations. For this reason, you may want to use a bijective base 256[0] encoding, so that there aren't multiple encodings for a single integer. In the UTF-8 world, there have been several security issues due to UTF-8 decoders not properly checking for non-canonical encodings and programmers doing slightly silly checks against constant byte arrays. A bijective base 256 saves you less than half a percent in space usage, but the cost is only one subtraction at encoding time and one addition at decoding time.

[0]https://en.wikipedia.org/wiki/Bijective_numeration

4 comments

s3th 3795 days ago

It's not too late! The wasm binary encoding is open to change up until the browsers ship a stable MVP implementation (then the plan is to freeze the encoding indefinitely at version 1).

The primary advantage of LEB128 is (as you mentioned) that it's a relatively common encoding. PrefixVarint is not an open source encoding IIUC.

We'll do some experiments in terms of speed. If the gains are significant we may be able to adopt something similar (this [0] looks like a related idea).

Thanks for the suggestion.

[0]: http://www.dlugosz.com/ZIP2/VLI.html

link

KMag 3795 days ago

PrefixVarint isn't open-source, but the encoding is trivial.

PrefixVarints are a folk theorem of Computer Science, (re-)invented in many times and places.

I actually coded it up once in Python and once in C before joining Google, and was chatting with an engineer, complaining about the Protocol Buffer varint encoding. The person I was complaining to, said "Yea, Doug Rhode did exactly that, called it PrefixVarint. He benchmarked it much faster."

link

KMag 3794 days ago

See my other comments on this thread for a simple implementation of a bijective big-endian prefix varint encoder. You may or may not want a bijective encoding, and probably want little-endian. I'm just used to writing big-endian encoders (for lexographical sorting reasons), so that was faster for me to whip up a demonstration of a bijective encoder.

A real implementation would use a switch statement instead of a loop. One might use a lookup table or a few instructions of inline assembly to calculate the number of leading ones in the first byte, and switch on that.

link

haberman 3795 days ago

I have been advocating for the PrefixVarint encoding you mention for a while.

One thing I'd mention though: as you've specified it here, it puts the continuation bits as the high bits of the first byte. I think it may be better to put them in the lower bits of that byte instead. It would allow for a simple loop-based implementation of the encoder/decoder (LEB128 also allows this). With continuation bits in the high bits of the first byte, you pretty much have to unroll everything. You have to give each length its own individual code-path, with hard-coded constants for the shifts and continuation bits.

The downside is one extra shift of latency in the one-byte case, imposed on all encoders/decoders.

Unrolling is probably a good idea for optimization anyway, but it seems better to standardize on something that at least allows a simple implementation.

Here is some sample code for a loop-based implementation that uses low bits for continuation bits:

    // Little-endian only. Untested.
    char *encode(char *p, uint64 val) {
      int len = 1;
      uint64 encoded = val << 1;
      uint64 max = 1 << 7;
      while (val > max) {
        if (max == 1ULL << 63) {
          // Special case so 64 bits fits in 9 bytes.
          *p++ = 0xff;
          memcpy(p, &val, 8);
          return p + 8;
        }
        encoded = (encoded << 1) | 1;
        max <<= 7;
        len++;
      }
      memcpy(p, &encoded, len);
      return p + len;
    }

    const char *decode(const char *p, uint64* val) {
      if (*p == 0xff) {
        // 9-byte special case
        memcpy(val, p + 1, 8);
        return p + 9;
      }

      // Can optimize with something like
      //   int len = __builtin_ctz(!*p);
      unsigned char b = *p;
      int len = 1;
      while (b & 1) {
        len++;
        b >>= 1;
      }

      *val = 0;
      memcpy(val, p, len);
      *val >>= len;
      return p + len;
    }

link

KMag 3795 days ago

You can have an equally simple implementation (plus one mask operation) if you put the length encoding in the most significant bits. The advantage of having length in the most significant bit is that in the common case (1 byte integers), the decoding is faster.

link

haberman 3795 days ago

Are you sure? It does not seem like it will be as simple. When continuation bits are at the top of the first byte, they come between the value bits in the first byte and value bits in the subsequent bytes. This means you have to manipulate them independently, instead of being able to manipulate them as an atomic group. With low continuation bits, all the value bits get to stay together.

If it would be as simple, you should be able to easily modify my sample encoder/decoder above to illustrate.

link

KMag 3794 days ago

Oops. You're right for little-endian encoders. (See another comment on this thread for a simple bijective big-endian encoder I whipped up just now.) I've always written big-endian encoders or bijective big-endian encoders, so that byte strings sort the same lexographically and numerically.

Though, a simple loop encoder and decoder are still easily doable if the unary length encoding is in the most significant bits. You're right, though, for a little-endian encoder, it's a slightly more simple to put the unary length encoding in the least significant bits.

link

TD-Linux 3795 days ago

I think this is definitely an improvement over the wasm varint implementation. However, wasm bytecode is almost always going to be delivered compressed with gzip or brotli, so measurements of compression and speed should be taken after those. In particular, I'm wondering if a plain non-variable integer encoding would be best, considering how brotli and gzip operate on byte sequences.

link

Joky 3795 days ago

This is definitely something I'd really like to see benchmarked: how valuable is it to pile two different "compression" compared only the "complicated" one (gzip or brotli).

link

acqq 3795 days ago

Can you please explain how you'd use "bijective numeration" specifically? What do you think has to be changed or added to your proposal:

    7 bits -> 0xxxxxxx
    14 bits -> 10xxxxxx xxxxxxxx
    ...

link

KMag 3794 days ago

A real implementation would probably be switch-driven, but I whipped up a terse implementation for a big-endian bijective encoder to go with my other comment (tested, but test code omitted):

    int varint_u64_encode(uint8_t** start, const uint8_t* limit, uint64_t value) {
      uint64_t offset;
      uint8_t* position;
      int bytes;

      for (bytes = 1, offset = 0x80;  value >= offset && bytes < 9; ++bytes) {
        offset = (offset << 7 ) | 0x80;
      }
      position = *start;
      value -= ( offset >> 7 ) ^ 1;
      if (position + bytes > limit) {return 0; /* not enough space */}
      *position = (((uint8_t)0xFF) << (9-bytes)) | (uint8_t)(value >> ((bytes-1) * 8 ));
      for (++position; --bytes > 0; ++position) {
        *position = (uint8_t) (value >> ((bytes-1) * 8));
      }
      *start = position;
      return 1;
    }
  
    int varint_u64_decode(const uint8_t** start, const uint8_t* limit, uint64_t* result) {
      uint64_t value;
      uint64_t offset;
      const uint8_t* position;
      int bytes;
      uint8_t mask;
  
      position = *start;  offset = 0;
      for(bytes=1, mask = 0x80; (mask & *position) == mask && bytes < 9; ++bytes) {
        mask = 0x80 | (mask >> 1);
        offset = (offset << 7) | 0x80;
      }
      if (position + bytes > limit) { return 0; /* not enough space */}
      value = (~mask) & *position;
      ++ position;
      for(; --bytes > 0; ++position) {
        value = (value << 8) + *position;
      }
      value += offset;
      if (bytes == 9 && value < (1uLL << 56)) {return 0; /* overflow, non-canonical encoding */ }
      *result = value;
      *start = position;
      return 1;
    }

link

KMag 3794 days ago

For a bijective base:

    7 bits -> Decode 0xxxxxxx and add 0 (unchanged)
    14 bits -> Decode 10xxxxxx xxxxxxxx and add 0x80
    21 bits -> Decode 110xxxxx xxxxxxxx xxxxxxxx and add 0x4080

In a non-bijective base vs a bijective base: 7 bits encode 0 to 2&7 - 1 vs. 0 to 2^7 -1 14 bits encode 0 to 2^14 - 1 vs. 2^7 to 2^14 + 2^7 - 1 21 bits encode 0 to 2^21 - 1 vs. 2^14 + 2^7 to 2^21 + 2^14 + 2^7 - 1 ...

In the bijective decoding routine, you need to special-case the maximum length case to check for numeric overflow.

link