yes, thank you. improving cache efficiency is by far the biggest single thing you can do to increase performance. if the code and data for both outcomes of an 'if' are in L1 cache, that 'if' is never going to be slow.
That's still dwarfed by a cache miss if you're unlucky.
I just found a bug (slow code is a bug) in the D backend where bad data layout led to 32 MILLION LLC misses (85% of the whole program) coming from one line!
Think about how much of a cacheline you are using per iteration folks.
The kind of data that is going to generate (enough) cache misses (to be a problem) behind your back is usually the stuff which you can't put on the stack.
Yup, and go is particularly bad for this because it handles allocations automatically (and poorly). I can double a go program's performance by going through the memory profile and rearranging the instructions to minimize hidden applications.
The worst offender is slices, since you can't mark them read only or stack allocated.