Hacker News new | ask | show | jobs
by jmillikin 750 days ago
The plot is the time to encode/decode a list of 10,000 random integers in a uniform distribution. The throughput is 2-10 GiB/s[0] on my desktop, with portable scalar operations (no SIMD). Distributions skewed toward small values are slightly faster[1].

I believe a decoder than used SIMD would perform even faster since the format is friendly to the BMI instructions found on newer CPUs. You could also store the data as concat(prefix_bytes) + concat(payloads), which would really get things going.

For scientific data with lots of multi-byte f64 values, the single branch for < 0xF0 might be unnecessary and undesirable. There are other formats that can be decoded by completely branchless SIMD, at the cost of larger sizes for very small values. For example, Google has a format that uses the first byte as a bitmask: https://static.googleusercontent.com/media/research.google.c...

[0] vu128 is less affected by value size than VLQ/LEB128, so on fast hardware it decodes 10,000 8-bit values in about the same time as 10,000 64-bit values.

[1] In the discussion on lobste.rs[2] a user wondered what the results would look like with a distribution containing lots of small values. A zipf distribution's results looks like <https://i.imgur.com/WGFtz6f.png>.

[2] https://lobste.rs/s/qvoe7a/vu128_efficient_variable_length#c...

1 comments

Thanks for the reply and clarification! That’s an impressive throughput. Am I interpreting the behavior correctly to think that: your code has one branch per int and that branch will be predicted well when your integers are all of similar magnitude?

I never fully groked the Zipf distribution but am I correct to conclude the worst case for vu128 would be alternating integers on either side of the branch condition?

Thanks for the link to the slides, also clever!

  > Am I interpreting the behavior correctly to think that: your code has
  > one branch per int and that branch will be predicted well when your
  > integers are all of similar magnitude?
Yep, that's generally correct. The branch prediction seems to be harmless either way -- interleaving values on either side of the branch doesn't seem to affect throughput much on my machine (results may vary).

  > I never fully groked the Zipf distribution but am I correct to conclude
  > the worst case for vu128 would be alternating integers on either side of
  > the branch condition?
It depends on whether you're measuring throughput by values/sec or bytes/sec.

In terms of bytes/sec, the worst case is a set of integers exclusively in the range [0xF0, 0xFF] -- big enough to skip the fast path, small enough that each byte has to go through the full bit-masking.

In terms of values/sec, the worst is really big integers because writing 8 bytes (or 16, for `encode_u128`) to a buffer takes longer than writing one byte.

However, regardless of measurement, the worst case for vu128 is faster than VLQ/LEB128 -- except for distributions containing only [0x00, 0x7F] in which case they're all equivalent.

---

Also, after some comments on Lobsters about cases where space efficiency in the u16 range really matters, I'm experimenting with a version that tries to match or exceed VLQ/LEB128 for size efficiency at all bit lengths.

It has three more branches, but in practice this hasn't turned out to matter much -- it seems that indefinite loops are slower than a fixed set of branches, so still significantly faster than VLQ/LEB128 on most sets of integers.

  // like LEB128 but continuation bits are packed into the
  // leading byte -- sometimes called "prefixed varint"
  
  0xxxxxxx                            //  7 bits of payload
  10xxxxxx xxxxxxxx                   // 14 bits of payload
  110xxxxx xxxxxxxx xxxxxxxx          // 21 bits payload
  1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // 28 bits payload

  // For payloads > 28 bits, use (0xF0 | length) type prefix,
  // same as original post.
  1111____  xxxxxxxx xxxxxxxx [...]
It seems to be more fussy about compiler optimizations, though: https://github.com/rust-lang/rust/issues/125543