Hacker News new | ask | show | jobs
by garfieldandthe 1104 days ago
The central myth is that the average programmer should care.

The typical programmer should treat CPU caches as what they are designed to be: mostly transparent. You work in a high level language and leave the tricky details to a library and your compiler.

It's only a small minority that should really worry about these things.

In my daily work, I see more often premature microoptimizations (in part using the myths from the article) which are entirely unnecessary rather than code that needs to optimize for those things.

2 comments

You should still at least care about 'CPU friendly' data layout in memory and data access patterns in your code to make the CPU's life easier, compiler magic won't help all that much there.

This can often trivially give you a 10x, and sometimes a 100x performance difference for real-world single-threaded code, especially if it needs to work on big data sets. Your fancy high level compiler won't magically reshuffle your data in memory to help with prefetching (at least I'm not aware of a language that does).

Most high level languages popular today are "rooted" in the 90's when the latency gap between CPU and memory practically didn't exist and thus don't care about his specific aspect. Explicit control over memory layout is probably also one of the a main reasons why C stood the test of time so well.

> This can often trivially give you a 10x, and sometimes a 100x performance difference for real-world single-threaded code

That's one hell of a lot of speed up, can you explain how you managed that? I mean I can certainly believe it but not when you put "trivially" in front of it, I could only conceive of that in very special cases and with very careful coding.

Sometimes you have some very very high-use variables, with multiple threads each having one and writing to it often. If multiple of those variables are located on the same cache line, control of it will bounce wildly between cores and this can completely trash your performance. The fix there is absolutely trivial: add padding.

Sometimes you're iterating through a huge number of objects, and only using a couple fields out of each object. If you rearrange the data so each field is stored in a different arena, you can cut the number of memory accesses by a huge amount, and let prefetching work a lot better. That's usually not as trivial but it's simple work.

Another often-trivial one is avoiding linked lists whenever feasible.

> Sometimes you have some very very high-use variables, with multiple threads each having one and writing to it often.

And how often do you actually write such code? Most people: Between rarely and never. You write single threaded code or use synchronization primitives. Sure if you are writing a library squeezing performance out of some parallel processing problem then this is relevant, but that's a niche scenario.

> The fix there is absolutely trivial: add padding.

Most common case is that this just wastes memory. Don't micro-optimizr before you know that this is actually a problem.

> Another often-trivial one is avoiding linked lists whenever feasible.

Again, not true, depending on your use case. If the operations that you commonly perform on the data structure, like inserting/deleting elements, then of course you should use a linked list or whatever container data structure your language provides. How caches play into this is at most a second order effect in the common case, unless you really want to optimize a tight loop in a performance critical application.

I've seen too many prematurely applied fancy data structures where it turned out that all this extra complexity was entirely unnecessary and just made things harder to maintain.

> Most common case is that this just wastes memory. Don't micro-optimizr before you know that this is actually a problem.

You don't have many variables that fit the description I gave. You're already doing profiling if you can pick them out, and preemptively spacing them would barely take any memory, and it's the kind of problem that's hard to thoroughly test unless you have a 64 core machine sitting around.

> If the operations that you commonly perform on the data structure, like inserting/deleting elements, then of course you should use a linked list

For situations where linked list and array are both usable, then even if you very commonly insert and delete you're usually better off with a data structure that's built on top of fixed-size arrays. Iterating an array is so fast that it makes up for the cost of shifting around a surprisingly large number of elements.

> or whatever container data structure your language provides.

But which one? Languages tend to have a lot.

And the ones that give you a one-size-fits-all data structure usually don't have a built-in linked list anyway.

And I'm not suggesting anything notably fancy.

Avoid cache line ping-pong, don't waste cache, and make accesses as spatially predictable as possible. Okay, I can see where you're coming from, thanks.
I haven't seen such dramatic speedups in my own code for quite a while since I don't tend to start with a worst case version.

With some googling it's easy to find cases like this where switching from a traditional OOP approach of "unique objects in random heap locations" to a DOD approach gains a 173x speedup overall (it's not just the tight memory layout of course, but also all the code simplifications this enables, like tighter loops and then easier integration of multithreading):

https://medium.com/@jasonbooth_86226/intro-to-jobs-burst-dod...

Basically, search for DOD (Data Oriented Design), ECS (Entity Component System), SoA (Structure of Arrays) for similar optimization stories.

Ages ago had a mathematician shard image processing across cores.

Started with a pair of loops for each image, foreach col, foreach row.

Problem: C++ 2d arrays are row-col not col-row.

Halfway through multi-threaded performance was much worse than single threaded.

Eventually we switched to row-col processing, single-threaded was fast enough, back to two loops per image.

This is partly wrong. For example, it's important how the fields are organized in a structure, and only the developer knows what belongs together and only they know the access patterns across threads (or at least should know). This is far from being a micro-optimization.
If I do it wrong, are we talking about my application (say, some web app) becoming noticeably slower? Or we would need to be running several million operations per second before anything a user could notice the difference?

I suspect only very high end games, image processing apps and that kind of thing would ever need to care about the order of fields in a struct... perhaps the author of my web server as well, but even that only if I have a ridiculously high load on my server (which I don't, almost no one does).

So, no I don't think I should know that until I develop one of those highly niche applications myself.

> are we talking about my application (say, some web app) becoming noticeably slower? Or we would need to be running several million operations per second before anything a user could notice the difference?

If you have an algorithm with quadratic or higher complexity somewhere you will be in the millions even if you only have a few thousand entries to deal with. Now the worst case cache miss (down to ram) is modeled as roughly 1000 cycles. Put that right in the middle of that O(n^2) algorithm and your web app will be able to handle 1 request per second on a good day.

Note: that calculation is extremely simplified

Maybe you shouldn't be doing O(n^2) inside a request handler in the first place. This has nothing to do with caches.

And even if you do quadratic operations in your handler, how often do you write a new such handler? Most of the time you work on stuff around that, supporting infra, testing, etc. None of those needs to be cache optimized either.

> Maybe you shouldn't be doing O(n^2)

It was mostly meant as a example of millions of operations even at a small scale. Sadly I have seen way too much accidental O(n^3) code and not all of it could be "fixed" trivially.

> This has nothing to do with caches.

Would you rather handle 1000 request per second or 1? Caching has the potential to make an already bad situation so much worse.

> how often do you write a new such handler?

You only need to write one to DoS your server.

> Would you rather handle 1000 request per second or 1?

Would you rather like to approach this problem by optimizing memory layout which, let's be honest, tends to give you more in the region of 1-10% improvements rather than 10-100x, or would you rather like to try to go from O(n^3) to O(n^2) or O(n log n) or similar?

The real meat is in the complexity class. Cache effects are where you go when there is nothing else to optimize and it actually matters. Preemptively making all your data structures 2-3 as big because of not-actually-necessary padding does not sound right to me, but YMMV.

How many of your structs does your code access million times a second? For most programs, that number is firmly 0.

While you are optimizing your struct layout and decrease you app start time from 1.122765 seconds to 1.122764 seconds, I ship production code at 2x the rate because I only optimize for performance where it matters while otherwise optimizing for maintainability, testability, extensibility, dev fun and actually shipping.