| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hansvm 426 days ago

Some related fun facts:

1. That roofline curve idea applies to multiple processes, computers, and data centers just as well. If you have enough "cache" (disk, RAM, whatever), you can do a distributed matmul and actually effectively use every coprocessor at nearly 100% efficiency.

2. If you need f32 intermediate precision, you can approximate that with Kahan-like ideas and still take advantage of the f16 core, at somewhere in the 25%-50% efficiency range (still much better than the <10% you get by ignoring the tensor core).

1 comments

saagarjha 426 days ago

Yep, the "3" in 3xTF32 kind of gives away the performance cost ;)

link