Hacker News new | ask | show | jobs
by Firadeoclus 1379 days ago
That's a pretty huge amount of processing power hidden away! Are these experimentally confirmed performance numbers?

So this is 16x16 single precision fused multiply-adds at 3.2GHz with a throughput of one per clock, right? (16 x 16 x 2 x 3.2e9 = 1.6384e12) And does using fp16 quadruple throughput again? That would put AMX well above the GPU for fp16 matrix multiplication! (2.6 TFLOPs fp16/fp32 for 8-core GPU, 128 multiply-add per core at 1.278GHz)

How does it compare in terms of power, can it sustain 3.2GHz indefinitely or does it hit power/thermal limits fairly quickly?

1 comments

Correct, yep. These are theoretical numbers, measured in cycles from a P-core (with no loads/stores), real-world performance tends to be a little less (~93%): https://twitter.com/stephentyrone/status/1455665595677085697

FP16 only doubles throughput rather than quadrupling.

I haven't looked at power/thermals, so I can't really comment. (Though it's possible it's always running a bit under 3.2GHz, since I was measuring in clock cycles - that might be part of the 7% difference.)