Hacker News new | ask | show | jobs
by Galanwe 244 days ago
While the results look impressive, I can't help but think "yeah but had you stored an absolute value every X deltas instead of just a stream of deltas, you would have had a perfectly scalable parallel decoding"
3 comments

I just did a mini-ablation study for this (prefix sum). By getting rid of the cross-block carry (16 values), you can increase perf from 19.85 to 23.45 GB/s: the gain is modest as most performance is lost on accumulator carry within the block.

An absolute value every 16 deltas would undermine compression: a greater interval would lose even the modest performance gain, while a smaller interval would completely lose the compressibility benefits of delta coding.

It's a different matter, although there is definitely plausible motivation for absolute values every X deltas: query/update locality (mini-partition-level). You wouldn't want to transcode a huge number of values to access/modify a small subset.

I think what GP is saying is you can drop an absolute value every e.g. 8192 elements (or even orders of magnitude more if you're actually processing GBs of element) and this frees you to compute the blocks in parallel threads in a dependency free manor. The latency for a block would still bound by the single core rate, but the throughput of a stream is likely memory bound after 2-3 cores. It still hurts the point of doing delta compression, but not nearly as bad as every 16 values would.

Even if one is willing to adopt such an encoding scheme, you'd still want to optimize what you have here anyways though. It also doesn't help, as mentioned, if the goal is actually latency of small streams rather than throughput of large ones.

Oh right. That's sensible enough. Makes total sense to parallelise across multiple cores.

I wouldn't expect a strictly linear speed-up due to contention on the memory bus, but it's not as bad as flat-lining after engaging 2-3 cores. On most AWS Graviton instances you should be able to pull ~5.5 GB/s per-core even with all cores active, and that becomes less of a bottleneck when considering you'll typically run a sequence of compute operations between memory round-trips (not just delta).

fgiesen has a 3 part series of blog posts from 2018, "reading bits in far too many ways", focusing on the problem of reading data encoded a variable-bit-length code

Part 3 explores a straightforward idea that can be used to improve performance of bitstream decoding - decode multiple streams at once.

> if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once?

> So how many streams should you use? It depends. At this point, for anything that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine.

> If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams

For delta coding, perhaps instead of resetting the delta encoding stream with an absolute value every X deltas, for a small value of X, this suggests we could use a much larger value of X, i.e. partition the input into coarse chunks and delta encode each, and then rewrite our decoder to decode a pair (or more) of the resulting streams at once.

Unclear if that would help at all here, or if there's already enough independent work for the CPU to keep busy with.

Part 1 https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...

Part 2 https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far...

Part 3 https://fgiesen.wordpress.com/2018/09/27/reading-bits-in-far...

While that sounds like a dealbreaker, I can't help think "yeah but if a decoding method took advantage of prefix in a similarly scalable way, one would reap the same benefits".