Hacker News new | ask | show | jobs
by danking00 750 days ago
If I’m following, this presents an alternative format and implementation to LEB128 which encodes and decodes substantially faster. Notably, the implementation is quite simple. Cool! And agreed that modern CPUs really suffer from branches.

Should I interpret the plot to mean the average elapsed wall clock time per integer decoded/encoded? And can I conclude the throughput is the reciprocal? So about 100,000 integers per second or around a 1 MB/s of decompressed data.

I know this is a bit unfair because the implementation is much more complex, but my first thought is why I would use vu128 instead of Lemire’s Stream VByte: https://arxiv.org/abs/1709.08990

A slight tangent but I stumbled on this library which stores floats XOR’ed with the previous float in the stream: https://github.com/velvia/compressed-vec it seems really clever to me! They reference “Gorilla: A Fast, Scalable, In-Memory Time Series Database” which in turn references two 2006 papers: “Fast Lossless Compression of Scientific Floating-Point Data” and “Fast and Efficient Compression of Floating-Point Data”. Frustratingly, the FB paper doesn’t benchmark their XOR-based floating point encoding but the earlier two papers do.

1 comments

The plot is the time to encode/decode a list of 10,000 random integers in a uniform distribution. The throughput is 2-10 GiB/s[0] on my desktop, with portable scalar operations (no SIMD). Distributions skewed toward small values are slightly faster[1].

I believe a decoder than used SIMD would perform even faster since the format is friendly to the BMI instructions found on newer CPUs. You could also store the data as concat(prefix_bytes) + concat(payloads), which would really get things going.

For scientific data with lots of multi-byte f64 values, the single branch for < 0xF0 might be unnecessary and undesirable. There are other formats that can be decoded by completely branchless SIMD, at the cost of larger sizes for very small values. For example, Google has a format that uses the first byte as a bitmask: https://static.googleusercontent.com/media/research.google.c...

[0] vu128 is less affected by value size than VLQ/LEB128, so on fast hardware it decodes 10,000 8-bit values in about the same time as 10,000 64-bit values.

[1] In the discussion on lobste.rs[2] a user wondered what the results would look like with a distribution containing lots of small values. A zipf distribution's results looks like <https://i.imgur.com/WGFtz6f.png>.

[2] https://lobste.rs/s/qvoe7a/vu128_efficient_variable_length#c...

Thanks for the reply and clarification! That’s an impressive throughput. Am I interpreting the behavior correctly to think that: your code has one branch per int and that branch will be predicted well when your integers are all of similar magnitude?

I never fully groked the Zipf distribution but am I correct to conclude the worst case for vu128 would be alternating integers on either side of the branch condition?

Thanks for the link to the slides, also clever!

  > Am I interpreting the behavior correctly to think that: your code has
  > one branch per int and that branch will be predicted well when your
  > integers are all of similar magnitude?
Yep, that's generally correct. The branch prediction seems to be harmless either way -- interleaving values on either side of the branch doesn't seem to affect throughput much on my machine (results may vary).

  > I never fully groked the Zipf distribution but am I correct to conclude
  > the worst case for vu128 would be alternating integers on either side of
  > the branch condition?
It depends on whether you're measuring throughput by values/sec or bytes/sec.

In terms of bytes/sec, the worst case is a set of integers exclusively in the range [0xF0, 0xFF] -- big enough to skip the fast path, small enough that each byte has to go through the full bit-masking.

In terms of values/sec, the worst is really big integers because writing 8 bytes (or 16, for `encode_u128`) to a buffer takes longer than writing one byte.

However, regardless of measurement, the worst case for vu128 is faster than VLQ/LEB128 -- except for distributions containing only [0x00, 0x7F] in which case they're all equivalent.

---

Also, after some comments on Lobsters about cases where space efficiency in the u16 range really matters, I'm experimenting with a version that tries to match or exceed VLQ/LEB128 for size efficiency at all bit lengths.

It has three more branches, but in practice this hasn't turned out to matter much -- it seems that indefinite loops are slower than a fixed set of branches, so still significantly faster than VLQ/LEB128 on most sets of integers.

  // like LEB128 but continuation bits are packed into the
  // leading byte -- sometimes called "prefixed varint"
  
  0xxxxxxx                            //  7 bits of payload
  10xxxxxx xxxxxxxx                   // 14 bits of payload
  110xxxxx xxxxxxxx xxxxxxxx          // 21 bits payload
  1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // 28 bits payload

  // For payloads > 28 bits, use (0xF0 | length) type prefix,
  // same as original post.
  1111____  xxxxxxxx xxxxxxxx [...]
It seems to be more fussy about compiler optimizations, though: https://github.com/rust-lang/rust/issues/125543