|
About the binary encoding... It's a bit easy to armchair these things, and it's too late for WebAsm now... but if you're on the V8 team, you have access to Google's PrefixVarint implementation (originally by Doug Rhode, IIRC from my time as a Google engineer). A 128-bit prefix varint is exactly as big as an LEB128 int in all cases, but is dramatically faster to decode and encode. It's closely related to the encoding used by UTF-8. Doug benchmarked PrefixVarints and found both Protocol Buffer encoding and Protocol Buffer decoding would be significantly faster if they had thought of using a UTF-8-like encoding. LEB128 requires a mask operation and a branch operation on every single byte, maybe skipping the final byte, so 127 mask operations and 127 branches. Using 32-bit or 64-bit native loads gets tricky, and I suspect all of the bit twiddling necessary makes it slower than the naive byte-at-a-time mask-and-branch. 7 bits -> 0xxxxxxx
14 bits -> 1xxxxxxx 0xxxxxxx
...
35 bits -> 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx
...
128 bits -> 1xxxxxxx 1xxxxxxx 1xxxxxxx ... xxxxxxxx
Prefix varints just shift that unary encoding to the front, so you have at most 2 single-byte switch statements, for less branch misprediction, and for larger sizes it's trivial make use of the processor's native 32-bit and 64-bit load instructions (assuming a processor that supports unaligned loads). 7 bits -> 0xxxxxxx
14 bits -> 10xxxxxx xxxxxxxx
...
35 bits -> 11110xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
...
128 bits -> 11111111 11111111 xxxxxxxx xxxxxxxx ... xxxxxxxx
There's literally no advantage to LEB128, other than more people have heard about it. A PrefixVarInt 128 is literally always the same number of bytes, it just puts the length-encoding bits all together so you can more easily branch on them, and doesn't make them get in the way of native loads for your data bits.Also, zigzag encoding and decoding is faster than sign extension, for variable-length integers. Protocol Buffers got that part right. Note that for security reasons, if there are no non-canonical representations, there can't be security bugs due to developers forgetting to check non-canonical representations. For this reason, you may want to use a bijective base 256[0] encoding, so that there aren't multiple encodings for a single integer. In the UTF-8 world, there have been several security issues due to UTF-8 decoders not properly checking for non-canonical encodings and programmers doing slightly silly checks against constant byte arrays. A bijective base 256 saves you less than half a percent in space usage, but the cost is only one subtraction at encoding time and one addition at decoding time. [0]https://en.wikipedia.org/wiki/Bijective_numeration |
The primary advantage of LEB128 is (as you mentioned) that it's a relatively common encoding. PrefixVarint is not an open source encoding IIUC.
We'll do some experiments in terms of speed. If the gains are significant we may be able to adopt something similar (this [0] looks like a related idea).
Thanks for the suggestion.
[0]: http://www.dlugosz.com/ZIP2/VLI.html