Hacker News new | ask | show | jobs
by superjan 1383 days ago
It looks a bit too simplistic: Blocks with up to 4k items are stored as sorted lists. Having to insert in a sorted list up to 4k items seems very expensive to me.
4 comments

CPUs have a strong performance bias toward sequential memory access and there are large threshold effects at work here. The block size used is not arbitrary. Improvements in prefetching and cache line utilization can have such large performance benefits that it more than justifies any apparent increase in computational cost because of how the code is organized to obtain those improvements.

Most developers do not have a good intuition for the efficiency and scalability of sequential brute-force on modern architectures. There is a cross-over threshold but I think many people would be surprised at how high it is in practice.

I would have expected 2048 items to be the optimal cutover point, because with 16-bit entries that would result in 4KB arrays. Those fit exactly into a typical CPU memory page, whereas 8KB requires two. In the worst case that might double the page table overheads, e.g.: for random point reads.

I wonder if it would be worth the trouble to code-gen a bunch of variants with things like 8-bit entries, and benchmark to death to determine the optimal cutover points...

I think if you go down this road, you'll start to see differences depending on the brand and model of CPU you use. Essentially, you're going into the territory of FFTW's "planner" scheme.
The threshold was chosen empirically. Modern processors are very fast when operating on memory sequentially, as the prefetcher achieves the best case scenario. The cost of a cache line miss is so substantial this ends up dominating things until "n" is in the 1000s. A lot of programmers' intuition is off the mark on this point.
My intuition was indeed trained a long time ago. In terms of memory use, the current choice is likely optimal. But I have a hard time believing that doing an or operation on two sets of sorted numbers will be as fast as the same for two bitmaps of 8k. The latter can be trivially simd accellerated, and could be done in-place.

Having said that, I applaud that this type of datastructure is being investigated. An efficient bitset should be present in every collections library.

The full arrays do get expensive, although not too bad. I work at FeatureBase and we have a whole analytics DB built on a roaring variant... for perf reasons it's usually worth it to bias toward the bitmap representation when you get past about 2k set bits, though it does take a bit more space.
The 4k items are only 512 bytes though, so expensive yes but not overly so.
No. The person you are replying to is talking about the sparse leaf case where the contained integers are actually being stored (to be more precise the low bits are being stored) in a sorted structure. The switch to the bitmap occurs when the sparse structure takes the same number of bits to directly encode the sparsely contained items. In this case they use a 2^16 bitmap and encode sparse elements using 16-bits, so it occurs at ((2^16 / 16) == 2^12) elements.
They are two byte per item; it's the bitmaps (>4k entries) that use 1 bit per item.