| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by p1esk 24 days ago
	A100 FP32 throughput “at its limit”: 19.5 TFLOP/s. AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

3 comments

tosh 24 days ago

A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

link

p1esk 24 days ago

Intel Xeon 6980P: 128 cores x 1024 FP16 FLOP/cycle/core x 3.2 GHz: 419 TFLOP/s

link

tosh 24 days ago

I'm not saying "GPU more brrt than CPU"

I found the comparison interesting

on Intel Xeon 690P with 419 TFLOP/s it is still (maybe even more?) interesting to ask:

how much throughput can you reach with Python, Python with lib x, y, z, with C++ like this, with C++ like that etc etc and why?

no?

link

p1esk 24 days ago

No one in their right mind would use pure Python to do matrix multiplication. It’s like using a screwdriver to hammer nails into wood.

But this discussion is even more bizarre than comparing a screwdriver to a hammer, it’s like comparing a screwdriver to a nail.

link

zzzoom 24 days ago

EPYC 9965: 614GBps of 12-channel DDR5-6400

A100: 1935GBps of HBM2e

Most of those FLOPS are constrained by memory bandwidth.

link

Const-me 24 days ago

> Most of those FLOPS are constrained by memory bandwidth

I believe inference with large enough batch size is almost always compute bound, simply due to algorithmic complexity.

Each step of tiled matric multiplication with square tiles of size N^2 takes O(N^2) memory loads and O(N^3) compute operations. With N = 32 or 64, you will likely saturate compute even on iGPUs with DDR4 or DDR5 memory pretending to be VRAM.

link

zzzoom 24 days ago

Prefill (GEMM) is compute bound, decode (GEMV) is memory bound.

link

Const-me 23 days ago

> decode (GEMV) is memory bound

Decode with batch size 1 is GEMV. Batching makes the decode GEMM too.

link

aesthesia 24 days ago

That's also a CPU that came out four years later than the A100. The contemporaneous B200 is not optimized for FP32 and does 74.45 TFLOP/s. For FP16 it's at ~2 PFLOP/s.

link

p1esk 24 days ago

The point is that modern CPUs are not as slow as most DL people think. Roughly 10x slower but with a lot more memory.

link