Hacker News new | ask | show | jobs
by saboot 1386 days ago
Is there a comparison with other fast cpu methods of matrix multiplication?
4 comments

On M1, for single-precision, one AMX P-unit is ~1.64 TFLOPs, one P-core is ~102 GFLOPS. So ~16x core-for-core. But you have four P-cores for every AMX P-unit, so more like 4x. And for double-precision that shrinks to 2x (~410 GFLOPs to ~51GFLOPs).

(This is a simplification that doesn't include the E-cores, nor the AMX E-unit, but their contribution isn't huge. I suspect AMX throughput may have doubled on M2, but I haven't verified that.)

That's a pretty huge amount of processing power hidden away! Are these experimentally confirmed performance numbers?

So this is 16x16 single precision fused multiply-adds at 3.2GHz with a throughput of one per clock, right? (16 x 16 x 2 x 3.2e9 = 1.6384e12) And does using fp16 quadruple throughput again? That would put AMX well above the GPU for fp16 matrix multiplication! (2.6 TFLOPs fp16/fp32 for 8-core GPU, 128 multiply-add per core at 1.278GHz)

How does it compare in terms of power, can it sustain 3.2GHz indefinitely or does it hit power/thermal limits fairly quickly?

Correct, yep. These are theoretical numbers, measured in cycles from a P-core (with no loads/stores), real-world performance tends to be a little less (~93%): https://twitter.com/stephentyrone/status/1455665595677085697

FP16 only doubles throughput rather than quadrupling.

I haven't looked at power/thermals, so I can't really comment. (Though it's possible it's always running a bit under 3.2GHz, since I was measuring in clock cycles - that might be part of the 7% difference.)

IIRC I measured something like 10-50% performance difference (don't remember exactly, but it was somewhere in there), vs a reasonably well-regarded blas implementation. This was for dgemm specifically; I don't know if the story changes for smaller floats.
Not too bad, I do scientific computing and choose Intel/Nvidia as their APIs for accelerated math operations are documented and supported for developers.

I've been paying attention to what Apple has been pushing with their M1/M2 chips, and I'm pretty tempted to try it out, but unless these features are documented and supported I can't feel comfortable writing programs relying on them.

The API apple wants you to use is documented and presumably is here to stay.

That doesn't help if there's some edge case you'd need access to the raw ISA but still.

Also, BLAS is part of that interface (https://developer.apple.com/documentation/accelerate/blas)

Of course it is a black box in that you can’t (realistically) try and speed it up. You still run the risk of Apple’s priorities being different from yours.

OTOH the interfaces are standard. Using Accelerate instead of the standard BLAS is just a compiler switch.
They are not supported directly, but you can ue them through the Accelerate framework, which has an optimised BLAS and FFT implementations (amongst many other things).
SGEMM / DGEMM using AMX2 (the first M1 has AMX2. The A14 has AMX1) is approximately 100% faster than the same running with NEON, which is already a specialized vector math system.