|
|
|
|
|
by pama
590 days ago
|
|
To clarify the title: TFLOP/s is the unit the author goes after, not TFLOP. People in the threads compare CUDA performance on GPUs to WebAssembly performance: please recall that H100 has a theoretical performance of about 1000 TFLOP/s for bfloat16, and even moderately complicated algorithms in typical modern transformer architectures can reach about half of that performance. |
|