| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pama 590 days ago
	To clarify the title: TFLOP/s is the unit the author goes after, not TFLOP. People in the threads compare CUDA performance on GPUs to WebAssembly performance: please recall that H100 has a theoretical performance of about 1000 TFLOP/s for bfloat16, and even moderately complicated algorithms in typical modern transformer architectures can reach about half of that performance.

1 comments

saagarjha 590 days ago

H100 can do well over 1500 TFLOPS in fp16.

link

nulltype 590 days ago

Which H100 and how much over 1500 TFLOP/s?

The datasheet for the H100 SXM seems to indicate that it can only do ~1000 TFLOP/s peak.

link

saagarjha 590 days ago

I just went to Nvidia’s site and downloaded the data sheet: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor.... It says 1600/1900 in half precision?

link

wtallis 590 days ago

Read the fine print: "With sparsity". They double the claimed throughput by assuming that half of the FLOPs can be skipped.

link

menaerus 590 days ago

I also recently went through the specs and noticed "with sparsity" but I didn't quite understand what it specifically refers to - the premise is that a lot of weights in matmul operations will be zero or insignificant - also known as sparse matrices - and in that case A100/H100 has a circuitry that can boost the throughput up to 2x, essentially "skipping" half of the FLOPS as you say.

I am not an expert in LLM but I don't think you can end up having a significant amount of zeroed weights (~50%) in a converged network so I think it is safe to say that the theoretical throughput for 99% of cases is really ~800 TFLOPS and not ~1600 TFLOPS as advertised.

link

saagarjha 590 days ago

Oh, that is really annoying. Thanks for catching that!

link

pama 590 days ago

There are two populations of people reading the NVIDIA specs (and now you switched groups). If NVIDIA ever changes their marketing strategy and the asterisk denotes something else, there might be a third population because I know a lot of people that I suspect will keep dividing those starred FLOPS/s by two :-)

link