Hacker News new | ask | show | jobs
by _a_a_a_ 1104 days ago
> This can often trivially give you a 10x, and sometimes a 100x performance difference for real-world single-threaded code

That's one hell of a lot of speed up, can you explain how you managed that? I mean I can certainly believe it but not when you put "trivially" in front of it, I could only conceive of that in very special cases and with very careful coding.

3 comments

Sometimes you have some very very high-use variables, with multiple threads each having one and writing to it often. If multiple of those variables are located on the same cache line, control of it will bounce wildly between cores and this can completely trash your performance. The fix there is absolutely trivial: add padding.

Sometimes you're iterating through a huge number of objects, and only using a couple fields out of each object. If you rearrange the data so each field is stored in a different arena, you can cut the number of memory accesses by a huge amount, and let prefetching work a lot better. That's usually not as trivial but it's simple work.

Another often-trivial one is avoiding linked lists whenever feasible.

> Sometimes you have some very very high-use variables, with multiple threads each having one and writing to it often.

And how often do you actually write such code? Most people: Between rarely and never. You write single threaded code or use synchronization primitives. Sure if you are writing a library squeezing performance out of some parallel processing problem then this is relevant, but that's a niche scenario.

> The fix there is absolutely trivial: add padding.

Most common case is that this just wastes memory. Don't micro-optimizr before you know that this is actually a problem.

> Another often-trivial one is avoiding linked lists whenever feasible.

Again, not true, depending on your use case. If the operations that you commonly perform on the data structure, like inserting/deleting elements, then of course you should use a linked list or whatever container data structure your language provides. How caches play into this is at most a second order effect in the common case, unless you really want to optimize a tight loop in a performance critical application.

I've seen too many prematurely applied fancy data structures where it turned out that all this extra complexity was entirely unnecessary and just made things harder to maintain.

> Most common case is that this just wastes memory. Don't micro-optimizr before you know that this is actually a problem.

You don't have many variables that fit the description I gave. You're already doing profiling if you can pick them out, and preemptively spacing them would barely take any memory, and it's the kind of problem that's hard to thoroughly test unless you have a 64 core machine sitting around.

> If the operations that you commonly perform on the data structure, like inserting/deleting elements, then of course you should use a linked list

For situations where linked list and array are both usable, then even if you very commonly insert and delete you're usually better off with a data structure that's built on top of fixed-size arrays. Iterating an array is so fast that it makes up for the cost of shifting around a surprisingly large number of elements.

> or whatever container data structure your language provides.

But which one? Languages tend to have a lot.

And the ones that give you a one-size-fits-all data structure usually don't have a built-in linked list anyway.

And I'm not suggesting anything notably fancy.

Avoid cache line ping-pong, don't waste cache, and make accesses as spatially predictable as possible. Okay, I can see where you're coming from, thanks.
I haven't seen such dramatic speedups in my own code for quite a while since I don't tend to start with a worst case version.

With some googling it's easy to find cases like this where switching from a traditional OOP approach of "unique objects in random heap locations" to a DOD approach gains a 173x speedup overall (it's not just the tight memory layout of course, but also all the code simplifications this enables, like tighter loops and then easier integration of multithreading):

https://medium.com/@jasonbooth_86226/intro-to-jobs-burst-dod...

Basically, search for DOD (Data Oriented Design), ECS (Entity Component System), SoA (Structure of Arrays) for similar optimization stories.

Ages ago had a mathematician shard image processing across cores.

Started with a pair of loops for each image, foreach col, foreach row.

Problem: C++ 2d arrays are row-col not col-row.

Halfway through multi-threaded performance was much worse than single threaded.

Eventually we switched to row-col processing, single-threaded was fast enough, back to two loops per image.