| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mhh__ 1879 days ago
	Usual disclaimer: The branch predictor is one of the things that will make your program slower, so keep it in mind, but it's the memory stupid(!) so if the powers that be want a faster program in a few days focus on memory layout and cache usage first.

2 comments

naikrovek 1879 days ago

yes, thank you. improving cache efficiency is by far the biggest single thing you can do to increase performance. if the code and data for both outcomes of an 'if' are in L1 cache, that 'if' is never going to be slow.

link

richardwhiuk 1879 days ago

depends how deep the pipeline is - with a long pipeline, a pipeline flush as a result of an incorrect predict can stall for tens of cycles.

link

mhh__ 1878 days ago

That's still dwarfed by a cache miss if you're unlucky.

I just found a bug (slow code is a bug) in the D backend where bad data layout led to 32 MILLION LLC misses (85% of the whole program) coming from one line!

Think about how much of a cacheline you are using per iteration folks.

link

naikrovek 1866 days ago

missing the cache will cost hundreds of cycles.

link

jupp0r 1878 days ago

Exactly! This is best achieved by keeping data on the stack and avoiding allocations as much as possible.

link

mhh__ 1878 days ago

The kind of data that is going to generate (enough) cache misses (to be a problem) behind your back is usually the stuff which you can't put on the stack.

link

jupp0r 1878 days ago

It can also be lots of small careless allocations all over the place.

link

kstenerud 1878 days ago

Yup, and go is particularly bad for this because it handles allocations automatically (and poorly). I can double a go program's performance by going through the memory profile and rearranging the instructions to minimize hidden applications.

The worst offender is slices, since you can't mark them read only or stack allocated.

link

antpls 1878 days ago

People need to focus on solving business problem first. "memory layout" and "cache usage" should be abstracted with premade data structures and algorithms into a library. You could then use "CacheFriendlyHashMap" class as a drop in. Does such library already exist for any language ?

link

logicchains 1878 days ago

>You could then use "CacheFriendlyHashMap" class as a drop in

It's impossible to have a generic cache-friendly datastructure because the optimal datastructure depends on access patterns. E.g. if I have a collection of struct{x:int, y:int, z:int, w:int}, and I know the two main usecases are indexing by x, and iterating over the collection performing some opp on (y,z,w), then the ideal datastructure is one where all the x are contiguous in memory ([x1, x2, x3...]) and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).

link

miscmask 1878 days ago

> and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).

Wouldn't having ys, zs and ws occupying different cache lines be good enough? After all, the CPU only wants fast access to these data. Or maybe it's a hard thing to do for CPUs to fetch from 4 different lines at a time (+1 for the instructions)?

link

adrianN 1878 days ago

I have a hard time imagining a programming language where memory layout of your data can be abstracted away like that. You need to think about which data to group together in memory and that depends on the access patterns. It's not enough to replace your binary trees with cache-friendlier B-trees if you still need to access half a dozen of them to perform one iteration of a loop.

link

eternalban 1878 days ago

JIT for data, or an off-line simulation run pass for compiler could help. We'd probably have this already if it wasn't for the fact that unlike code, a JIT regime for memory can't be universally effective. So for the PLT part of it, maybe domain constrained languages? But that said, that all would bring into userland all the gymnastics that VM is doing with paging, etc. I think the more interesting question is what would an OS + language families that convey semantics/intent to the OS look like?

link