Hacker News new | ask | show | jobs
by sliken 1517 days ago
I saw this quantified (I think at anandtech), something like 220GB/sec out of 400GB/sec on the M1 max. So something north of 50%.

Also keep in mind that normal x86-64's, even without an IGP only get about 60-65% of peak, even with nothing else sharing the memory bus. I often see this quantified with McCalpin's stream benchmark.

So the M1 Ultra likely has a pretty impressive memory bandwidth of around 440GB/sec, which isn't a large fraction of 800GB/sec, but it still more than any other desktop or server chip I know of. The AMD Epcy maxes out at 8 channels of DDR-3200, which is in the neighborhood of 208GB/sec peak, with an observed bandwidth of 110-120GB/sec.

2 comments

> I saw this quantified (I think at anandtech)

Correct. The numbers we have are from their M1 Max deep dive, with the M1 Ultra being two M1 Max chips fused together.

For CPU cores:

>Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

https://www.anandtech.com/show/17024/apple-m1-max-performanc...

GPU cores:

>I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

Other:

>That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. Doing the same thing on my 5900X machine results in single-digit frames. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems.

> of around 440GB/sec, which isn't a large fraction of 800GB/sec

Where's the 800GB/s from?

Peak = never observed, but calculated from clock speed * bus width. Much like the speed of light, you'll never see it.

That number for the M1 Ultra (from the OP's post) = 800GB/sec. McCalpin's stream benchmark is often cited as a practical/useful number for usable bandwidth using a straight forward implementation in C or Fortran without trying to play games, much like the vast majority of codes out there.

Also note that the x86-64's in the world use a strict memory model that results in a lower fraction of observed bandwidth vs peak. Arm has a looser memory model which achieves a higher fraction of peak.