Hacker News new | ask | show | jobs
by jblow 2401 days ago
Disagree on that last part. If more programmers understood cache coherency, maybe their programs would not run like a giant turd.
7 comments

Are the details really helpful for performance work? MOSI, MOESI, MERSI, MESIF etc are 99% irrelevant to having the right metal model. "Dirtying cache lines across different cores/threads is slow" is most of what you need to know. Within the same core the coherency protocols between levels of cache is not really visible at all to software even as varying performance artifacts.

Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture and making a cross cpu cache miss slightly less slow could just maybe be helped by the detailed understanding of the machine model, but by that time you're already far in the not-giant-turd territory (at least if you're optimizing the right thing).

Of course computer architecture is fascinating and fun to learn about.

> Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture

Or if you work in finance, mining, oil industry, medical, biosciences and countless other fields where you need to get good performance out of your hardware. Yes, even if you use GPUs, they're no magic bullet, they also have their architectural bottlenecks, strengths and weaknesses.

Or if you care about power consumption.

There are a lot of reasons to optimize hot paths and inner loops. CPU single core performance isn't improving much anymore, and we need to make better use of what we have.

I was trying to say that even in those cases you almost never benefit from knowing the details of cache coherency.
That's a very bold statement.

You could just as well for example say that front end Javascript developer almost never needs to understand event callbacks or how DOM works.

If you write multithreaded high performance code, yeah, you do need to know about cache coherency at varying levels of detail. Sometimes rough rules of thumb work, not too often you need to understand all those annoying performance destroying details that leak through cache abstraction.

DOM and event callbacks are basic bread and butter for JS developers and in web apps it's essential to understand how DOM works and how to minimize DOM changes when doing performance work. Can you come up with an example where a oil or finance sector developer needs to understand MESIF?

Or are we having a terminology confusion and you use the coherency term for software visible performance characteristics of caches generally? I do agree that understanding cache effects and their intersection with multiprocessing is generally important in perf work. As is understanding the architected memory model, which tells you what you can and can't rely on semantically.

> Can you come up with an example where a oil or finance sector developer needs to understand MESIF?

Yes. When they're writing high performance multithreaded analysis software and they're deciding which cache lines to write to and which only to read from. Those lines you only read from can be in S (shared) or F (forward) state.

And why is this important? Performance characteristics of a line entering in M (modified) state are pretty bad, if the line is shared between multiple cores.

Perhaps you'll also want to do all reads at once, knowing the line will more likely remain in F state for short periods, instead of bouncing S/F line state between CPU cores?

NUMA comes also to play, making this mistake even more costly. You really want to keep inter-core (especially NUMA socket!) communication to the minimum.

Of course, you could say you don't strictly need to understand MESI (or MESIF), but it really helps understanding why you do things certain way and reasoning about multi-core performance. The thing is, you can say same "you don't need to know this" about a lot of "low level details" in software trade.

Just like you need to understand DOM as a front-end developer to minimize DOM changes, even if you don't access it directly.

In cache coherency case it's analogously about reducing unnecessary multicast and broadcast messages sent between cores.

>> Dirtying cache lines across different cores/threads is slow

Very succinct and correct. This is particularly pernicious with atomics, which people use for lock-free stuff such as queues and work stealing thread pools. If atomics used by different threads/cores share a cache line and you mutate them a lot, perf gets instantly fucked, sometimes worse than if you used a mutex in the first place. And if you don't benchmark, you aren't going to notice.

Surely the whole point of good system design is a set of logical abstractions, where you need to understand the logical model and not the internal details - as these are free to be evolved.

Of course performance matters, but surely having performance tests, rather than trying to second guess what the whole stack below you might be doing, is

1. more efficient 2. more accurate 3. more likely to detect changes in a timely way.

That's not to say, you shouldn't be curious and deep understanding isn't a good thing.

Just saying understanding inside-out the abstraction you are working with ( eg Java Memory Model ) it's performance characteristics ( from real world testing ) - is more important than some passing knowledge of real world CPU design.

This app I am using right now is in a webbrowser - not sure how understanding cache coherency helps in a single threaded javascript.

> Surely the whole point of good system design is a set of logical abstractions, where you need to understand the logical model and not the internal details

In my opinion an important part of being a good programmer is understanding - at least at a broad level - how the set of abstractions you're working on top of work.

At the end of the day, our job is making computer hardware operate on data in memory. The more that we forget that, and think about computing as some abstract endeavor performed in Plato's heaven, the more tendency we have for bloat and inefficiency to creep into our various abstractions.

In other words, I think it's better to think about abstractions as a tool for interacting with hardware, not as something to save us from dealing with hardware.

Eh, idk. Most programs that run like a giant turd do it because they load 15 megabytes of Javascript libraries to call two functions, or something to that effect. Computers are so fast now that you really need to be doing something unbelievably stupid for things in consumer programs to not be instantaneous.
Like start Firefox? I have no idea why it takes multiple seconds to bring up a blank window when starting a comparably useful program like Claws Mail is absolutely instantaneous.
To be honest, yes, generally programs that don't start instantaneously do something very unnecessary and stupid that has nothing to do with actual CPU performance. As in they read thousands of tiny files or they decompress files or they wait for some sort of answer from the network, or they make 10k+ expensive API calls or all of the above. Firefox is so big it might fall into the "all of the above" category, but to answer that question definitively you'd have to analyze what Firefox actually does.

And of course, making the decompression 15-20% faster by optimizing the decompression code (which is usually not even written by the developers of said software but just some external library) won't even make a difference because 20% less than 5 seconds is still 4 seconds which is way too long for a program to start. Instead using a different compression algorithm that increases file size by 25% but decompression speed by 10x would actually start solving the problem, with the next step being to ask why the program needs to read so much damn data at the start in the first place.

But since NVMe SSDs and Intel CPUs with very high boost clocks are quickly becoming the norm now even for laptops I don't see much of that happening, because Firefox starts pretty quickly (~1 second) on those machines.

Fwiw, Firefox stores almost everything it will need on startup in one large file (omni.ja) precisely to avoid the "thousands of tiny files" problem. That data is uncompressed, precisely to avoid the decompression problem. The low-hanging fruit has largely been picked.

As for why so much data needs to be read... I just checked, and on Mac the main Firefox library (the executable itself is mostly a stub) is 120MB. So that's going to take a second or three just to read in at typical HDD speeds (faster on a good SSD), and then the dynamic linker has to do its thing on that big library, which is not instantaneous either.

How big is the Claws Mail binary? A large part of the cost of starting up Firefox, or other large programs, is just getting the binary from disk into memory plus all the work the dynamic linker needs to do once it's there.
Most engineers don't write code with hard performance constraints. Game devs probably need to be fighting to get every frame.

For the bulk of the eng I work with the concept of StoreLoad reordering on x86 would be an academic distraction.

> Most engineers don't write code with hard performance constraints.

Only because most software "engineers" don't give two shits about the actual user experience of their glacially slow over-engineered garbage.

99.999% of performance issues in development are related to the abstract model, and not the underlying implementation details. Things such as searching in a big unordered list, repeating the same work, stupid SQL queries ect.
The way I usually follow it is 1) Is this OK? For example It is only called once in the code in an error path? 2) Did I do something silly? For example, I left in some extra debug code. 3) is the code doing something silly? For example extra work, dragging in extra data that is not needed? 4) is the code written in a way that causes extra work, re-init on inner loop, loop hoisting, etc 5) Is there a better algorithm for this? Binary search vs linear? 6) is the code doing too many things at once. De-serialize it re-serialize it 7) Is there a better abstraction for this? Monolith code vs microservice? 8) Is there any compiler switches that are 'easy' wins? Packing, -O2, etc? Usually not. 9) What sort of memory arch does this machine have? It varies between machines even for the x86 world. For example if I rearrange my memory structures do I use less cache and emit less code. The cache line on this box is x bytes and on that box is y bytes. Some languages do not lend themselves well to this one due to the way their VM/ISA is written. 10) is there some asm that would help? usually not.

Usually 1-7 are all you need. If you get all the way to 10 you are in the deep end for most things.

Big O is good for many things. But in reality big O is O(N)+C where the C can get you. That is where the later steps help. But usually you can totally ignore it. Most of the big wins I get are just from flipping out a bad search for a O(log(n)) search, or removing 'extra' code.

On the BigO notation, you have to remember is that it's an upper bound on the runtime. There is no guarantee other than the growth of the function.
Oh I agree. It is just that +C bit. What happens when your O(nLog(n)) basically just trashes the cache because of your data structures? Yet maybe something 'worse' would run better because of the way cache works? That +c bit can sometimes turn into its own big O. It is a nasty little gotcha when it does happen.

It is a good framework to get you in the ballpark of the correct thing. Even usually 99% of the time it is right. But sometimes the arch bites back due to your data.

The worst piece of code I had to optimize was running a database search with an O(n^3) algorithm.

Fixing that did not require knowledge of cache lines.

No, because most software development houses prioritize getting something that works over performance wankery.
Performance is a feature.
But not a feature a substantial number of paying customers care about.
We'll never know if that's true, because they were never given the option.
The problem is you never know when you’ll go from “dev who doesn’t care” to “dev who does”.

While I agree that the details of StoreLoad are likely a distraction the big picture concepts of cache coherence presented in this article are table stakes for performant systems.

The problem is you never know when you’ll go from “dev who doesn’t care” to “dev who does”.

This is true for literally every hard problem in dev though, and the implication that you need to grasp everything just in case you need it is silly. The problem space in compsci is too big to know everything. We have to choose.

Part of the reason this kind of blogs exist is to give you more information to choose properly.
While that might be true in practice, I do think not knowing when you transition between those two indicates a failing by your coworkers/mentors.

I encourage people I work with to read many of the books in this book series. I particularly encourage them to read “Hardware and Software Support for Virtualization”, since it’s basically a book on their job.

Just to be clear, the book series you are referring to is "Synthesis Lectures on Computer Architecture" published by Morgan-Claypool, right?

If you had to rank them in importance for the average engineer, how would you rank them?

Money is another way to measure performance. People that understand these low level principles can write a backend service in C/C++/Java/Erlang/Whatever that can do the work of hundreds of instances of a service written in higher level languages that ignore these concepts. That's not to say that it's never wise to use higher level languages, but these principles can and do save mature established businesses real money.
True, but if my bad python implementation can do the job on one old single core computer with plenty of CPU left over what do I care - in fact because python is generally more productive for small programs it is a lot cheaper. Most programs don't actually handle so much data that they need every bit of performance.

Of course the minority that do need performance are where the real engineers are needed.

I'm writing a little C++/Qt application to visualise a fairly small amount of data.

The lag is already noticeable. If I don't pay attention to performance, I know the end result will be slow and unpleasant to use.

That is true, but you are likely better off using a library than focusing yourself on cache optimization.
Using a library (specifically, taking advantage of Qt's QVariant and view/model framework), is what triggered most of the slowdown.
I was just in a room with a guy who claims he "takes pride in his work" so everything has to be as fast as technically possible., use the least RAM down to KBs, never write to a disk if possible, etc. The problem is the schools. Academics have no concept of who is paying for everything around them its like asking a welfare recipient to teach you about budgets.
I disagree on this. More programmers should understand the performance characteristics of the abstraction layers that they rely on. Else we have the case of jerk programmers who insist on redesigning / rewriting / refactoring to optimise for CPU caches while still running the apps via a bunch of docker containers each based of the centos image to run one simple binary that probably needs only glibc.
Maybe but it feels like most are stuck in environments which will do bad things to their cache coherence. It's fine if you are doing some data processing in C. If you are using .NET with a bunch of magic libraries or Javascript or whatever, sure it will help, but to actually make an impact you have to be very careful.
I agree with you Jonathan but am wondering, will Jai help programmers write programs with better cache coherency even if said programmers don’t understand cache coherency well? Or is that orthogonal to the goals of Jai?
Like any reasonable language, I imagine JAI will allow a developer to write software with the cache in mind. It will not force them to, nor will it prevent them from doing otherwise.
If it still plans on wrapping SOA in something that looks in use like AOS, it could make people less aware of how their code is impacting cache (would now have to look at the definitions instead of seeing the array form at point of usage). But if it is enough more ergonomic to write AOS code it might still be worth it and increase uptake.
The developer would still need to understand enough to be able to choose between them though, even if using them is identical.
Only if he releases it