| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dougall 1386 days ago
	On M1, for single-precision, one AMX P-unit is ~1.64 TFLOPs, one P-core is ~102 GFLOPS. So ~16x core-for-core. But you have four P-cores for every AMX P-unit, so more like 4x. And for double-precision that shrinks to 2x (~410 GFLOPs to ~51GFLOPs). (This is a simplification that doesn't include the E-cores, nor the AMX E-unit, but their contribution isn't huge. I suspect AMX throughput may have doubled on M2, but I haven't verified that.)

1 comments

Firadeoclus 1386 days ago

That's a pretty huge amount of processing power hidden away! Are these experimentally confirmed performance numbers?

So this is 16x16 single precision fused multiply-adds at 3.2GHz with a throughput of one per clock, right? (16 x 16 x 2 x 3.2e9 = 1.6384e12) And does using fp16 quadruple throughput again? That would put AMX well above the GPU for fp16 matrix multiplication! (2.6 TFLOPs fp16/fp32 for 8-core GPU, 128 multiply-add per core at 1.278GHz)

How does it compare in terms of power, can it sustain 3.2GHz indefinitely or does it hit power/thermal limits fairly quickly?

dougall 1386 days ago

Correct, yep. These are theoretical numbers, measured in cycles from a P-core (with no loads/stores), real-world performance tends to be a little less (~93%): https://twitter.com/stephentyrone/status/1455665595677085697

FP16 only doubles throughput rather than quadrupling.

I haven't looked at power/thermals, so I can't really comment. (Though it's possible it's always running a bit under 3.2GHz, since I was measuring in clock cycles - that might be part of the 7% difference.)