| HN Mirror

This isn't that hard. A sane person would do 3 cache lines at once as 192 bytes = 8 24 byte tuples. You would do 3 AVX512 loads, a masked compare at the proper places (actually, I think the masked compare instructions can take a memory operand, which might get you uop fusion so the load+compare is one instruction) yielding 3 masks each with 16 bits (of course, most of the bits would be junk). The 16 bit masks can be shift+or'ed together (whether as "k-regs" or in GPRs) and the correct bits can be extracted with PEXT.

The downside of this is that you are still reading 6 times as much data. A straightforward implementation of this should not be CPU bound IMO. If a Skylake Server can't keep up with memory doing 32-bit compares I'll eat my hat.

Gather is not a good idea for this purpose. Gather is very expensive. It's really mainly good for eliminating pointer chasing and keeping a SIMD algorithm in the SIMD domain.