|
|
|
|
|
by dougall
1386 days ago
|
|
On M1, for single-precision, one AMX P-unit is ~1.64 TFLOPs, one P-core is ~102 GFLOPS. So ~16x core-for-core. But you have four P-cores for every AMX P-unit, so more like 4x. And for double-precision that shrinks to 2x (~410 GFLOPs to ~51GFLOPs). (This is a simplification that doesn't include the E-cores, nor the AMX E-unit, but their contribution isn't huge. I suspect AMX throughput may have doubled on M2, but I haven't verified that.) |
|
So this is 16x16 single precision fused multiply-adds at 3.2GHz with a throughput of one per clock, right? (16 x 16 x 2 x 3.2e9 = 1.6384e12) And does using fp16 quadruple throughput again? That would put AMX well above the GPU for fp16 matrix multiplication! (2.6 TFLOPs fp16/fp32 for 8-core GPU, 128 multiply-add per core at 1.278GHz)
How does it compare in terms of power, can it sustain 3.2GHz indefinitely or does it hit power/thermal limits fairly quickly?