Hacker News new | ask | show | jobs
by bazizbaziz 3155 days ago
FWIW it should be possible to vectorize the search across the row store. With 24 byte tuples (assuming no inter-row padding) you can fit 2.6 rows into a 64 byte cache line (a 512 bit simd register). Then it's just a matter of proper bit masking and comparison. Easier said than done, I figure because that remainder is going to be a pain. Another approach is to use gather instructions to load the first column of each row and effectively pack the first column into a register as if it were loaded from a row store and then do a vectorized compare as in the column-store case.

All of that to underscore it's not that one format vectorizes and the other doesn't. The key takeaway here is that with the column store, the compiler can automatically vectorize. This is especially a bonus for JVM based languages because afaik there is no decent way to hand-roll SIMD code without crossing a JNI boundary.

2 comments

This isn't that hard. A sane person would do 3 cache lines at once as 192 bytes = 8 24 byte tuples. You would do 3 AVX512 loads, a masked compare at the proper places (actually, I think the masked compare instructions can take a memory operand, which might get you uop fusion so the load+compare is one instruction) yielding 3 masks each with 16 bits (of course, most of the bits would be junk). The 16 bit masks can be shift+or'ed together (whether as "k-regs" or in GPRs) and the correct bits can be extracted with PEXT.

The downside of this is that you are still reading 6 times as much data. A straightforward implementation of this should not be CPU bound IMO. If a Skylake Server can't keep up with memory doing 32-bit compares I'll eat my hat.

Gather is not a good idea for this purpose. Gather is very expensive. It's really mainly good for eliminating pointer chasing and keeping a SIMD algorithm in the SIMD domain.

Possible, yes. Performant or pleasant? Maybe not :)