| I often benchmark my DIY data structures, as well as check the assembly on godbolt. I think I wrote the code using ML model (they save lot of time), and then fixed the issues and extended it myself. I have the code here: https://pastebin.com/GzY4HMsZ I do the search multiple (thousands) times and calculate total time spent, so the timer is called only 2 times. I search across the same array, but for the different number each time. So the array should be completely within L1 cache. It's also important to use the result of the search otherwise the compiler simply omits the code. I use clock_gettime() which takes on order of 1us. Timestamp counter (which is faster) is not available for me because every CPU core has its own counter and they show different values, so Linux refuses to use it. First core is 600 ms ahead of other cores. How in 2025 we cannot make a boring counter amazes me. Or, maybe this is intentional to prevent using cheaper consumer hardware for professional purposes. Here are some numbers: Array size 8, 200 000 iterations, time ns/seach: linear 8 ns, linear no-branch <1 ns (I wonder if it is an error), binary 11 ns, binary no-branch 7 ns. Array size 128, 100 000 iterations, time: linear 34 ns, linear no-branch 22 ns, binary 35 ns, binary no-branch 16 ns. As you can see, the branches is what ruins the performance here. Every time the CPU mis-predicts a branch, it wastes several cycles. Looking through disassembly on godbolt, the compiler inlined and vectorized the measurement loop for no-branch code, which might explain the numbers. The SIMD assembly is hard to read and I don't quite understand what it does: https://godbolt.org/z/dofKnT3W3 I'll be happy to read if you point me at any mistakes in my benchmark. If you want to compile the code, I used gcc with "-O3" and maybe with "-march=native". |
You could circumvent the counter issue by pinning to a specific core, otherwise you'd have to use scheduler events (ftrace/perf on Linux) to know which CPU you were running on at specific times and when so you could subtract the right counters (I've had to do this before and it isn't pretty). It would also prevent issues where your task gets preempted and moved to another core and you have to warm the caches again. If you use Linux, options like isolcpus and nohz_full ensure you're not measuring the scheduler/other tasks but have to be set at boot. But if your loops between clock_gettime take substantially more than clock_gettime itself then at least the timer overhead is probably not that important.
As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.
My SIMD knowledge only extends as far as which compiler options make my code faster.
No shame using LLMs, I use them extensively, but I find I have to write some code yourself because I make noticeably more mistakes coding if I have let the LLM do everything for a couple of weeks.