Hacker News new | ask | show | jobs
by mhh__ 1879 days ago
Usual disclaimer: The branch predictor is one of the things that will make your program slower, so keep it in mind, but it's the memory stupid(!) so if the powers that be want a faster program in a few days focus on memory layout and cache usage first.
2 comments

yes, thank you. improving cache efficiency is by far the biggest single thing you can do to increase performance. if the code and data for both outcomes of an 'if' are in L1 cache, that 'if' is never going to be slow.
depends how deep the pipeline is - with a long pipeline, a pipeline flush as a result of an incorrect predict can stall for tens of cycles.
That's still dwarfed by a cache miss if you're unlucky.

I just found a bug (slow code is a bug) in the D backend where bad data layout led to 32 MILLION LLC misses (85% of the whole program) coming from one line!

Think about how much of a cacheline you are using per iteration folks.

missing the cache will cost hundreds of cycles.
Exactly! This is best achieved by keeping data on the stack and avoiding allocations as much as possible.
The kind of data that is going to generate (enough) cache misses (to be a problem) behind your back is usually the stuff which you can't put on the stack.
It can also be lots of small careless allocations all over the place.
Yup, and go is particularly bad for this because it handles allocations automatically (and poorly). I can double a go program's performance by going through the memory profile and rearranging the instructions to minimize hidden applications.

The worst offender is slices, since you can't mark them read only or stack allocated.

People need to focus on solving business problem first. "memory layout" and "cache usage" should be abstracted with premade data structures and algorithms into a library. You could then use "CacheFriendlyHashMap" class as a drop in. Does such library already exist for any language ?
>You could then use "CacheFriendlyHashMap" class as a drop in

It's impossible to have a generic cache-friendly datastructure because the optimal datastructure depends on access patterns. E.g. if I have a collection of struct{x:int, y:int, z:int, w:int}, and I know the two main usecases are indexing by x, and iterating over the collection performing some opp on (y,z,w), then the ideal datastructure is one where all the x are contiguous in memory ([x1, x2, x3...]) and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).

> and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).

Wouldn't having ys, zs and ws occupying different cache lines be good enough? After all, the CPU only wants fast access to these data. Or maybe it's a hard thing to do for CPUs to fetch from 4 different lines at a time (+1 for the instructions)?

I have a hard time imagining a programming language where memory layout of your data can be abstracted away like that. You need to think about which data to group together in memory and that depends on the access patterns. It's not enough to replace your binary trees with cache-friendlier B-trees if you still need to access half a dozen of them to perform one iteration of a loop.
JIT for data, or an off-line simulation run pass for compiler could help. We'd probably have this already if it wasn't for the fact that unlike code, a JIT regime for memory can't be universally effective. So for the PLT part of it, maybe domain constrained languages? But that said, that all would bring into userland all the gymnastics that VM is doing with paging, etc. I think the more interesting question is what would an OS + language families that convey semantics/intent to the OS look like?