Hacker News new | ask | show | jobs
by wolfgangK 527 days ago
Amazingly thorough ! I love how the author leaves no stone unturned. I had no idea you could do the kind of low level efficiency shaving in Rust. I wonder how a C++ implementation with https://github.com/jfalcou/eve would compare.
4 comments

Thanks! It's somewhat tiring to not have loose ends, but I agree it pays off :)

Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.

Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.

You kind of lost me towards the end, so I’m not sure whether you attempted it, but I was wondering whether it would be possible for the internal nodes to store only the upper 8/16 bits (that are nonzero for the current subtree). This implies that one 64 byte cache line stores 32 or 64 entries instead of 16 (or better: 31 or 62/63, since u may need some bookkeeping dat).

The next level would be to keep track of how many prefix bits are implied by the parents, so leaf nodes could perhaps also only use 8/12/16 bits, if the the higher bits are implied by the parent. or instead of bit masks, use offsets (i.e. leaves store k bits offset from parent value).

That may screw around with the branch predictor, and may work very good with evenly distributed data vs uneven data.

Ohh yes some good ideas there! I've thought about pretty much exactly this at some point but then didn't end up including it. But I do think it's quite promising, and most of the bookkeeping can be made to be branchless.

On the other hand, most of the latency is in the last few layers, and probably there isn't as much to be saved there.

The biggest problem might be that the bottom layer will anyway have to store full-width numbers, since we must be sure to have the low-order bits somewhere in case they weren't covered yet in earlier layers. Or we could have variable width encoding per node maybe (instead of per layer) but that does sound a bit iffy on the branch predictor.

In the end I guess I kinda like the simplicity and hence reliability of the 'just do the full tree and the first layers are cheap anyway' approach. Probably another factor 2 speedup is possible on specific datasets, but anything beyond this may not be reliably good on worst case inputs (like is the issue for the prefix partitioning methods).

The lowest level could be encoded with whatever number of bits is required for the numeric distance between the leafs. With proper book keeping, the full 32 bit values don’t need to be stored. The value for each node should be something like (parent << shift + node), where node has ideally 8 Bits, or maybe 16 bits.

It kind of comes down to how to deal with bad distribution of values, like large differences between adjacent values. One could for example deal with this by deliberately leaving „gaps“ in the tree, like using a 59-ary node on places (make the last values in the array large, so they will never get visited). With dynamic programming, perhaps an optimal distribution of the leaf level could be had - although it requires more bookkeeping bits of one needs to have the index of the found value, not just whether the value is in the tree.

The question could also be whether it could be interesting to store delta valeues, so that 63 8-bit valeues gives a larger numeric range - this means computing some sort of cumulative sum on on the 63 values, not sure whether there is a simd instruction for that.

One more thought: if there is different ways different layers of the tree are stored, but they are always the same for each layer, the code be unrolled, with every layer having different code to deal with it.

Last thought: if the index is not important to know, just whether the value is in the set or not, one could use a bitmap as a data structure. It would require 256MB for the 32bit space, but large spans of zeros imply empty pages.

There are stones still unturned. For typical unicode table lookup's you'd need batches for all characters of a word. It would be much faster to lookup eg 16 chars at once, to optimize cache misses.

But when I tested this even Eytzinger was too slow.

Have you used Eve successfully for anything? It’s very confusing, it took me far too long to uppercase a string with it.
Just played a bit with it. Were you working with ASCII ? This example didn't work for you ? https://github.com/jfalcou/eve/blob/a141ba93048bb2916c2157a9...
Yeah, Compare to highway as well

https://github.com/google/highway

I once searched github for simd libraries, sorted by popularity, and added most popular of them to my list: https://github.com/stars/Bulat-Ziganshin/lists/simd

indeed, highway is the popularity leader, it implements lots of SIMD operations and supports lots of hardware including SVE/RVV with runtime SIMD width, although IMHO it has a bit longer learning curve