Hacker News new | ask | show | jobs
by hansvm 426 days ago
Some related fun facts:

1. That roofline curve idea applies to multiple processes, computers, and data centers just as well. If you have enough "cache" (disk, RAM, whatever), you can do a distributed matmul and actually effectively use every coprocessor at nearly 100% efficiency.

2. If you need f32 intermediate precision, you can approximate that with Kahan-like ideas and still take advantage of the f16 core, at somewhere in the 25%-50% efficiency range (still much better than the <10% you get by ignoring the tensor core).

1 comments

Yep, the "3" in 3xTF32 kind of gives away the performance cost ;)