| Some questions/remarks: - in the initial `Contains` code snippet (and all early-break variants), most performance is probably lost by branch misses on which element returns true. Probably much better is something like `table[0] == fp | table[1] == fp | table[2] == fp | table[3] == fp` (which might actually get optimized to int/SIMD anyway; who knows). - What is shown in the table with benchmarks? I assume 'operations' is the number of lookups, but how large is the array? Memory latency is probably one of the big effects on overall runtime?
- It's probably more useful to show time per lookup than mean 'total time'. - Also, how do you run the benchmarks? If you make a fixed array of queries and query those many times, the same cachelines will be hit and you won't measure memory latency. Also, in this case the branchpredictor might learn all branches, so best is to only do each query once in your benchmark. - You don't really comment on the differences between benches with different number of 'operations'. Are there any takeaways from this? (otherwise just don't show them?) - Maybe you could try with some `u8x8` SIMD instructions as well? Although the bitmasking is also cute as-is :) |
2 and 3. The "operations" are both the array's size (actually the nearest power of 2) and lookup operations. I first thought that latency should be equal for both cases, because my benchmarks may be too naive - they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket. I'll try to experiment with a more shuffled workload in my free time. Thank you for the question. I've not thought about it deeply. Also, counting mispredictions is a good idea; I'll add this metric to my benchmarks to get a complete picture.
4. I didn't pay much attention to that because the trend was still the same, and even the ratio was similar for different counts of lookups per underlying memory layout. There was just some minor deviation, but I ignored it. Maybe I'll run more granular tests with hardware counters to catch some insights.
5. SIMD, in my opinion, would be an overhead there because for a single lookup in my configuration, I'm using a single 32-bit value, and Cuckoo Filter will touch at max 64 bits (2 lookups). However, I was planning to experiment with batching of lookups; I just haven't reached it yet.