| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by YourDadVPN 55 days ago
	How did you do it? Hopefully you generated thousands of such arrays and measured the cost of searching all of them as a single iteration? Because depending on what you use to measure time, the overhead of reading the timer would likely dwarf the cost of searching such a small array. The best case is a dedicated cycle counter instruction/register (e.g. rdtsc) and even that may cost hundreds of cycles. Cache hits cost less than a cycle so if your timing code gated a small number of searches you essentially didn't measure the search at all. That aside your findings point to the prefetcher having identified your linear searches as sequential access so practically every single access was a cache hit (you effectively measured the latency between a CPU and its L1 cache). If you wanted to test this you could do something like make each element some number of cache lines wide. Stream prefetchers have a maximum stride, so variables more than that many bytes apart won't be prefetched. Google's multichase benchmark uses 256 bytes IIRC.

1 comments

codedokode 54 days ago

I often benchmark my DIY data structures, as well as check the assembly on godbolt. I think I wrote the code using ML model (they save lot of time), and then fixed the issues and extended it myself. I have the code here: https://pastebin.com/GzY4HMsZ

I do the search multiple (thousands) times and calculate total time spent, so the timer is called only 2 times. I search across the same array, but for the different number each time. So the array should be completely within L1 cache. It's also important to use the result of the search otherwise the compiler simply omits the code.

I use clock_gettime() which takes on order of 1us. Timestamp counter (which is faster) is not available for me because every CPU core has its own counter and they show different values, so Linux refuses to use it. First core is 600 ms ahead of other cores. How in 2025 we cannot make a boring counter amazes me. Or, maybe this is intentional to prevent using cheaper consumer hardware for professional purposes.

Here are some numbers:

Array size 8, 200 000 iterations, time ns/seach: linear 8 ns, linear no-branch <1 ns (I wonder if it is an error), binary 11 ns, binary no-branch 7 ns.

Array size 128, 100 000 iterations, time: linear 34 ns, linear no-branch 22 ns, binary 35 ns, binary no-branch 16 ns.

As you can see, the branches is what ruins the performance here. Every time the CPU mis-predicts a branch, it wastes several cycles.

Looking through disassembly on godbolt, the compiler inlined and vectorized the measurement loop for no-branch code, which might explain the numbers. The SIMD assembly is hard to read and I don't quite understand what it does: https://godbolt.org/z/dofKnT3W3

I'll be happy to read if you point me at any mistakes in my benchmark. If you want to compile the code, I used gcc with "-O3" and maybe with "-march=native".

link

YourDadVPN 54 days ago

Interesting, thanks for sharing. I actually assumed ?: desugared to an if-else and never checked. I was a bit confused by the if-else chains you have e.g. on line 111 where it looks like you could just use the else branch for every case?

You could circumvent the counter issue by pinning to a specific core, otherwise you'd have to use scheduler events (ftrace/perf on Linux) to know which CPU you were running on at specific times and when so you could subtract the right counters (I've had to do this before and it isn't pretty). It would also prevent issues where your task gets preempted and moved to another core and you have to warm the caches again. If you use Linux, options like isolcpus and nohz_full ensure you're not measuring the scheduler/other tasks but have to be set at boot. But if your loops between clock_gettime take substantially more than clock_gettime itself then at least the timer overhead is probably not that important.

As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.

My SIMD knowledge only extends as far as which compiler options make my code faster.

No shame using LLMs, I use them extensively, but I find I have to write some code yourself because I make noticeably more mistakes coding if I have let the LLM do everything for a couple of weeks.

link

codedokode 54 days ago

> I was a bit confused by the if-else chains you have e.g. on line 111 where it looks like you could just use the else branch for every case?

The point of "ifs" there is to replace a variable "size" with a constant like 8, 16 so that the compiler can hardcode it in the code and maybe make it more efficient (for example, instead of counting the iterations the compiler can just repeat the instruction 8 times when iteration count is 8). I expect that the compiler would generate customized "linear_search" function versions instead of using generic version with unknown number of iterations. You can see the example in godbolt assembly code at line 825, where the compiler unrolled a loop for "linear_search" function with size = 8 - there is no loop, just 8 comparisons.

> As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.

I don't think so. In binary search there is no predictable pattern, so the predictor would miss in 50% of cases. And in linear search there is a guaranteed miss in the last iteration.

> You could circumvent the counter issue by pinning to a specific core,

I think I'll try this in future.

> but I find I have to write some code yourself

I can write it myself, but it would take 20-30 minutes including time to find and read the docs for functions while for LLM it takes less than a minute. But anyway I have to edit the result a lot, for example LLM "forgot" that you need to use the result of the search to prevent compiler from throwing the code away. Also, free version of ChatGPT became much dumber couple of months ago.

link

YourDadVPN 54 days ago

Fair point that binary search causes a lot of branch mispredictions. Did you ever measure whether the unrolled linear search with 8 elements outperforms a "rolled" one? Because with instruction level parallelism I wonder how much difference removing 8 mostly easily predicted (in the linear search) branches makes.

link

codedokode 54 days ago

I didn't try, but loop contains a branch so there could be misprediction, which doesn't happen with unrolled loop. Also, instruction level parallelism applies to unrolled code as well.

link