|
|
|
|
|
by s3th
3748 days ago
|
|
As we get closer to having a WebAssembly demo ready in multiple browsers, the group has added a small little website on GitHub [0] that should provide a better overview of the project than browsing the disparate repos (design, spec, etc.). Since the last time WebAssembly hit HN, we've made a lot of progress designing the binary encoding [1] for WebAssembly. (Disclaimer: I'm on the V8 team.) [0]: http://webassembly.github.io/
[1]: https://github.com/WebAssembly/design/blob/master/BinaryEnco... |
|
LEB128 requires a mask operation and a branch operation on every single byte, maybe skipping the final byte, so 127 mask operations and 127 branches. Using 32-bit or 64-bit native loads gets tricky, and I suspect all of the bit twiddling necessary makes it slower than the naive byte-at-a-time mask-and-branch.
Prefix varints just shift that unary encoding to the front, so you have at most 2 single-byte switch statements, for less branch misprediction, and for larger sizes it's trivial make use of the processor's native 32-bit and 64-bit load instructions (assuming a processor that supports unaligned loads). There's literally no advantage to LEB128, other than more people have heard about it. A PrefixVarInt 128 is literally always the same number of bytes, it just puts the length-encoding bits all together so you can more easily branch on them, and doesn't make them get in the way of native loads for your data bits.Also, zigzag encoding and decoding is faster than sign extension, for variable-length integers. Protocol Buffers got that part right.
Note that for security reasons, if there are no non-canonical representations, there can't be security bugs due to developers forgetting to check non-canonical representations. For this reason, you may want to use a bijective base 256[0] encoding, so that there aren't multiple encodings for a single integer. In the UTF-8 world, there have been several security issues due to UTF-8 decoders not properly checking for non-canonical encodings and programmers doing slightly silly checks against constant byte arrays. A bijective base 256 saves you less than half a percent in space usage, but the cost is only one subtraction at encoding time and one addition at decoding time.
[0]https://en.wikipedia.org/wiki/Bijective_numeration