|
|
|
|
|
by hansvm
426 days ago
|
|
Some related fun facts: 1. That roofline curve idea applies to multiple processes, computers, and data centers just as well. If you have enough "cache" (disk, RAM, whatever), you can do a distributed matmul and actually effectively use every coprocessor at nearly 100% efficiency. 2. If you need f32 intermediate precision, you can approximate that with Kahan-like ideas and still take advantage of the f16 core, at somewhere in the 25%-50% efficiency range (still much better than the <10% you get by ignoring the tensor core). |
|