|
|
|
|
|
by ant6n
528 days ago
|
|
You kind of lost me towards the end, so I’m not sure whether you attempted it, but I was wondering whether it would be possible for the internal nodes to store only the upper 8/16 bits (that are nonzero for the current subtree). This implies that one 64 byte cache line stores 32 or 64 entries instead of 16 (or better: 31 or 62/63, since u may need some bookkeeping dat). The next level would be to keep track of how many prefix bits are implied by the parents, so leaf nodes could perhaps also only use 8/12/16 bits, if the the higher bits are implied by the parent. or instead of bit masks, use offsets (i.e. leaves store k bits offset from parent value). That may screw around with the branch predictor, and may work very good with evenly distributed data vs uneven data. |
|
On the other hand, most of the latency is in the last few layers, and probably there isn't as much to be saved there.
The biggest problem might be that the bottom layer will anyway have to store full-width numbers, since we must be sure to have the low-order bits somewhere in case they weren't covered yet in earlier layers. Or we could have variable width encoding per node maybe (instead of per layer) but that does sound a bit iffy on the branch predictor.
In the end I guess I kinda like the simplicity and hence reliability of the 'just do the full tree and the first layers are cheap anyway' approach. Probably another factor 2 speedup is possible on specific datasets, but anything beyond this may not be reliably good on worst case inputs (like is the issue for the prefix partitioning methods).