| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lhl 1144 days ago
	Although there are multiple bottlenecks, my understanding (and why at a certain point, throwing more threads doesn't work) is that inference for dense LLMs are largely limited by memory bandwidth. Most desktop computers will have dual channel DDR4/DDR5 memory which will be hard pressed to get >60GB/s. A last-gen Epyc/Threadripper Pro should have 8 channel memory DDR4-3200 support, which should get you a theoretical max of 204.8 GB/s (benchmarking ends up more around 150GB/s in AIDA64). The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).

3 comments

Ambix 1143 days ago

Yeah, it's really so bad on desktops.

With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)

[0] https://github.com/gotzmann/llama.go

link

nullc 1142 days ago

To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.

link

aljungberg 1139 days ago

To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.

link

logicchains 1144 days ago

Thank you, that makes sense. I had no idea that there was such a dramatic difference in memory bandwidth between desktop and server CPUs.

link

cjbprime 1144 days ago

The two-channel DDR5 in desktops can't even do two channels very well -- if you try to put 64GB RAM in (two dual-rank 32GB DIMMs) then you lose around 50% of the bandwidth compared to a single rank DIMM (e.g. from 8GHz to 4GHz speeds, and increased latency).

link