Hacker News new | ask | show | jobs
by groundthrower 1657 days ago
It does not consume much memory but do lots of allocations/deallocations. No disc operations whatsoever.
4 comments

M1 has a larger L1 cache, but smaller L3 cache.

It could very well be that your application is hitting a memory pattern that favors larger L1 cache, while the huge L3 cache of EPYC is not useful.

------

If you really wanted to know, you should learn how to use hardware performance counters and check out the instructions-per-clock. If you're around 1 or 2 instructions per clock tick, then you're CPU-bound.

If you're less than that, like 0.1 instructions per clock (ie: 10 clocks per instruction), then you're Cache and/or RAM-bound.

-----

From there, you continue your exploration. You count up L1 cache hits, L2 cache hits, L3 cache hits and cache-misses. IIRC, there are some performance counters that even get into the inter-thread communications (but I forget which ones off the top of my head). Assuming you were cache/ram bound of course (if you were CPU-bound, then check your execution unit utilization instead).

EPYC unfortunately doesn't have very accurate default performance counters, and I'd bet that no one really knows how to use M1 performance counters yet either.

While the default PMC counters of AMD/EPYC are inaccurate (but easy to understand), AMD has a second set of hard-to-understand, but very accurate profiling counters called IBS Profiling: https://www.codeproject.com/Articles/1264851/IBS-Profiling-w...

Still, having that information ought to give you a better idea of "why" your code performs the way it does. You may have to activate IBS-profiling inside of your BIOS before these IBS-profiling tools work.

By default, AMD only has the default performance counters available. So you may have a bit of a struggle juggling the BIOS + profiler to get things working just right, and then you'll absolutely struggle at understanding what the hell you're even looking at once all the data is in.

This.

I have dabbled with the AMD & Intel Xeon side of this, but never on MacOS. Do you have an idea how one would go about getting performance counters on MacOS? IPC, L1hit/miss, L2 hitless etc.

Unfortunately not. I only have experience on the AMD-side as I played around on my own personal computer.
Thanks, appreciated!
I’d suggest investigating single core performance. If you have the money, buy an i9-12900K (slightly faster single-core than M1 but much hotter) and do some testing on that. If my theory is correct, performance will be even better.
We have examined that as well, last week we tried a AMD 5950X which has half the amount of cores but much better single core performance - the result was still at 60% of the Epyc performance
What was the M1 % relative to your Epyc?
Roughly 10% faster
Have you investigated memory constraints?

Ryzen is 2 channels; Epyc is 4-8 (depending on CPU). M1 has that stupidly fast/wide setup.

If your Epyc is one of the 4 channel optimized SKUs or is only running in 4 channel mode, you would get pretty close to the quoted ratios on a memory bandwidth test.

Correlation, not causation, but worth looking into.

Also check Node per Socket (NPS) settings on EPYC
HN makes us wait for replies… so if we need to continue this further I’m open at muse.theses-0z@icloud.com .

My next question would be if you ran the 12900K in dual-channel memory.

As others have noted this sounds like a contention issue that you should fix by not allocating in your hot path if at all possible. The easiest fix would probably be to try to switch out your global allocator for something like https://github.com/gnzlbg/jemallocator and see if that doesn't give you a nice performance boost.
Hmm, yes we are already using jemallocator actually
It sounds like you might be running into some sort of contention.