|
|
|
|
|
by Firadeoclus
1379 days ago
|
|
That's a pretty huge amount of processing power hidden away! Are these experimentally confirmed performance numbers? So this is 16x16 single precision fused multiply-adds at 3.2GHz with a throughput of one per clock, right? (16 x 16 x 2 x 3.2e9 = 1.6384e12) And does using fp16 quadruple throughput again? That would put AMX well above the GPU for fp16 matrix multiplication! (2.6 TFLOPs fp16/fp32 for 8-core GPU, 128 multiply-add per core at 1.278GHz) How does it compare in terms of power, can it sustain 3.2GHz indefinitely or does it hit power/thermal limits fairly quickly? |
|
FP16 only doubles throughput rather than quadrupling.
I haven't looked at power/thermals, so I can't really comment. (Though it's possible it's always running a bit under 3.2GHz, since I was measuring in clock cycles - that might be part of the 7% difference.)