Hacker News new | ask | show | jobs
by YourDadVPN 44 days ago
Interesting, thanks for sharing. I actually assumed ?: desugared to an if-else and never checked. I was a bit confused by the if-else chains you have e.g. on line 111 where it looks like you could just use the else branch for every case?

You could circumvent the counter issue by pinning to a specific core, otherwise you'd have to use scheduler events (ftrace/perf on Linux) to know which CPU you were running on at specific times and when so you could subtract the right counters (I've had to do this before and it isn't pretty). It would also prevent issues where your task gets preempted and moved to another core and you have to warm the caches again. If you use Linux, options like isolcpus and nohz_full ensure you're not measuring the scheduler/other tasks but have to be set at boot. But if your loops between clock_gettime take substantially more than clock_gettime itself then at least the timer overhead is probably not that important.

As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.

My SIMD knowledge only extends as far as which compiler options make my code faster.

No shame using LLMs, I use them extensively, but I find I have to write some code yourself because I make noticeably more mistakes coding if I have let the LLM do everything for a couple of weeks.

1 comments

> I was a bit confused by the if-else chains you have e.g. on line 111 where it looks like you could just use the else branch for every case?

The point of "ifs" there is to replace a variable "size" with a constant like 8, 16 so that the compiler can hardcode it in the code and maybe make it more efficient (for example, instead of counting the iterations the compiler can just repeat the instruction 8 times when iteration count is 8). I expect that the compiler would generate customized "linear_search" function versions instead of using generic version with unknown number of iterations. You can see the example in godbolt assembly code at line 825, where the compiler unrolled a loop for "linear_search" function with size = 8 - there is no loop, just 8 comparisons.

> As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.

I don't think so. In binary search there is no predictable pattern, so the predictor would miss in 50% of cases. And in linear search there is a guaranteed miss in the last iteration.

> You could circumvent the counter issue by pinning to a specific core,

I think I'll try this in future.

> but I find I have to write some code yourself

I can write it myself, but it would take 20-30 minutes including time to find and read the docs for functions while for LLM it takes less than a minute. But anyway I have to edit the result a lot, for example LLM "forgot" that you need to use the result of the search to prevent compiler from throwing the code away. Also, free version of ChatGPT became much dumber couple of months ago.

Fair point that binary search causes a lot of branch mispredictions. Did you ever measure whether the unrolled linear search with 8 elements outperforms a "rolled" one? Because with instruction level parallelism I wonder how much difference removing 8 mostly easily predicted (in the linear search) branches makes.
I didn't try, but loop contains a branch so there could be misprediction, which doesn't happen with unrolled loop. Also, instruction level parallelism applies to unrolled code as well.