Hacker News new | ask | show | jobs
by al2o3cr 1653 days ago

    don’t ask me how
_You_ should be asking you how - there are lots of reasons why this could be happening and knowing which one is important if you're changing stuff.

Based on a "highly parallelizable" application performing better on 8 cores than 32, I'd guess you're running out of something else: memory or disk bandwidth.

4 comments

Probably the hardest thing to clean up is a codebase where very complicated "optimizations" were built because someone didn't understand some very basic bottlenecks.

I recently inherited an app that makes heavy use of Redis caching because someone didn't first try to optimizing SQL. The complexity that Redis caching adds is insane to maintain compared to spending a few minutes optimizing SQL.

The original poster really needs to hook up a profiler.

Also: having written lots of parallel code: Parallelization isn't a magic way to make things faster. If the codebase is breaking up tasks into lots of tiny tasks that run in parallel, there might be more overhead in parallelization than needed. Sometimes the fastest (performance and implementation) way to parallelize is to keep most of the codebase serial, but only parallelize at the highest level and never share data among operations.

The old... anything but reviewing the execution plan approach... throw more vCPU's at it! Thank god for query store.
If his application is running better on 8 instead of 32, that reeks to me of a dependency on single-core performance somewhere. An example of this would be Minecraft, which performs worse on heavily-multi core systems compared to a few fast cores (like M1).
Also Dwarf Fortress, which runs tons of simulations but is a single-threaded 32bit application, which makes multithreaded performance and RAM beyond ~2GB meaningless.
+1. They should start profiling their application. If its running on alpine linux e.g. the default memory allocator is extremely bad and would degrade performance - but it could also be tons of other things. Taking random actions without understanding what the current bottleneck is will never be great long term.
It does not consume much memory but do lots of allocations/deallocations. No disc operations whatsoever.
M1 has a larger L1 cache, but smaller L3 cache.

It could very well be that your application is hitting a memory pattern that favors larger L1 cache, while the huge L3 cache of EPYC is not useful.

------

If you really wanted to know, you should learn how to use hardware performance counters and check out the instructions-per-clock. If you're around 1 or 2 instructions per clock tick, then you're CPU-bound.

If you're less than that, like 0.1 instructions per clock (ie: 10 clocks per instruction), then you're Cache and/or RAM-bound.

-----

From there, you continue your exploration. You count up L1 cache hits, L2 cache hits, L3 cache hits and cache-misses. IIRC, there are some performance counters that even get into the inter-thread communications (but I forget which ones off the top of my head). Assuming you were cache/ram bound of course (if you were CPU-bound, then check your execution unit utilization instead).

EPYC unfortunately doesn't have very accurate default performance counters, and I'd bet that no one really knows how to use M1 performance counters yet either.

While the default PMC counters of AMD/EPYC are inaccurate (but easy to understand), AMD has a second set of hard-to-understand, but very accurate profiling counters called IBS Profiling: https://www.codeproject.com/Articles/1264851/IBS-Profiling-w...

Still, having that information ought to give you a better idea of "why" your code performs the way it does. You may have to activate IBS-profiling inside of your BIOS before these IBS-profiling tools work.

By default, AMD only has the default performance counters available. So you may have a bit of a struggle juggling the BIOS + profiler to get things working just right, and then you'll absolutely struggle at understanding what the hell you're even looking at once all the data is in.

This.

I have dabbled with the AMD & Intel Xeon side of this, but never on MacOS. Do you have an idea how one would go about getting performance counters on MacOS? IPC, L1hit/miss, L2 hitless etc.

Unfortunately not. I only have experience on the AMD-side as I played around on my own personal computer.
Thanks, appreciated!
I’d suggest investigating single core performance. If you have the money, buy an i9-12900K (slightly faster single-core than M1 but much hotter) and do some testing on that. If my theory is correct, performance will be even better.
We have examined that as well, last week we tried a AMD 5950X which has half the amount of cores but much better single core performance - the result was still at 60% of the Epyc performance
What was the M1 % relative to your Epyc?
Roughly 10% faster
Have you investigated memory constraints?

Ryzen is 2 channels; Epyc is 4-8 (depending on CPU). M1 has that stupidly fast/wide setup.

If your Epyc is one of the 4 channel optimized SKUs or is only running in 4 channel mode, you would get pretty close to the quoted ratios on a memory bandwidth test.

Correlation, not causation, but worth looking into.

HN makes us wait for replies… so if we need to continue this further I’m open at muse.theses-0z@icloud.com .

My next question would be if you ran the 12900K in dual-channel memory.

As others have noted this sounds like a contention issue that you should fix by not allocating in your hot path if at all possible. The easiest fix would probably be to try to switch out your global allocator for something like https://github.com/gnzlbg/jemallocator and see if that doesn't give you a nice performance boost.
Hmm, yes we are already using jemallocator actually
It sounds like you might be running into some sort of contention.