Hacker News new | ask | show | jobs
by curiouscoding 322 days ago
Some questions/remarks:

- in the initial `Contains` code snippet (and all early-break variants), most performance is probably lost by branch misses on which element returns true. Probably much better is something like `table[0] == fp | table[1] == fp | table[2] == fp | table[3] == fp` (which might actually get optimized to int/SIMD anyway; who knows).

- What is shown in the table with benchmarks? I assume 'operations' is the number of lookups, but how large is the array? Memory latency is probably one of the big effects on overall runtime? - It's probably more useful to show time per lookup than mean 'total time'.

- Also, how do you run the benchmarks? If you make a fixed array of queries and query those many times, the same cachelines will be hit and you won't measure memory latency. Also, in this case the branchpredictor might learn all branches, so best is to only do each query once in your benchmark.

- You don't really comment on the differences between benches with different number of 'operations'. Are there any takeaways from this? (otherwise just don't show them?)

- Maybe you could try with some `u8x8` SIMD instructions as well? Although the bitmasking is also cute as-is :)

2 comments

Hi, thank you for your questions: 1. The "one-line" boolean condition still has jumps in the produced JIT Asm, at least to make a short circuit to exit as soon as possible. I guess the C# compiler is far too conservative for such optimisations.

2 and 3. The "operations" are both the array's size (actually the nearest power of 2) and lookup operations. I first thought that latency should be equal for both cases, because my benchmarks may be too naive - they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket. I'll try to experiment with a more shuffled workload in my free time. Thank you for the question. I've not thought about it deeply. Also, counting mispredictions is a good idea; I'll add this metric to my benchmarks to get a complete picture.

4. I didn't pay much attention to that because the trend was still the same, and even the ratio was similar for different counts of lookups per underlying memory layout. There was just some minor deviation, but I ignored it. Maybe I'll run more granular tests with hardware counters to catch some insights.

5. SIMD, in my opinion, would be an overhead there because for a single lookup in my configuration, I'm using a single 32-bit value, and Cuckoo Filter will touch at max 64 bits (2 lookups). However, I was planning to experiment with batching of lookups; I just haven't reached it yet.

> they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket

Yeah; that completely eliminates the cache misses / memory latency you'd have in practice. Of course eliminating that bottleneck is useful if you want to purely optimize CPU speed, but probably not quite as representative of real workloads. Also makes sense then that different array sizes give similar results: streaming tends to be fast anyway, regardless of the array size.

The one liner shouldn't have shortcircuiting since it uses the binary or |, not the boolean/logical or||.
Aha, you're right, didn't look carefully
Another remark:

If you benchmark it as something like `for q in queries { contains(q); }`, especially the branchless variants are probably executed in parallel by the CPU, and you are measuring throughput instead of latency. That may or may not be relevant depending on the application.

Hm, that's actually may be true, thank you for the heads-up