| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by logicchains 1150 days ago
	That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 256 GB of ram I get around 3 tokens/sec, which is useable if not ideal. I wonder if the difference is related to the RAM or the number of CPUs?

5 comments

lhl 1149 days ago

Although there are multiple bottlenecks, my understanding (and why at a certain point, throwing more threads doesn't work) is that inference for dense LLMs are largely limited by memory bandwidth. Most desktop computers will have dual channel DDR4/DDR5 memory which will be hard pressed to get >60GB/s. A last-gen Epyc/Threadripper Pro should have 8 channel memory DDR4-3200 support, which should get you a theoretical max of 204.8 GB/s (benchmarking ends up more around 150GB/s in AIDA64).

The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).

link

Ambix 1149 days ago

Yeah, it's really so bad on desktops.

With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)

[0] https://github.com/gotzmann/llama.go

link

nullc 1147 days ago

To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.

link

aljungberg 1145 days ago

To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.

link

logicchains 1149 days ago

Thank you, that makes sense. I had no idea that there was such a dramatic difference in memory bandwidth between desktop and server CPUs.

link

cjbprime 1149 days ago

The two-channel DDR5 in desktops can't even do two channels very well -- if you try to put 64GB RAM in (two dual-rank 32GB DIMMs) then you lose around 50% of the bandwidth compared to a single rank DIMM (e.g. from 8GHz to 4GHz speeds, and increased latency).

link

azeirah 1150 days ago

I'm following the discussions on GitHub as well as their PRs closely.

The primary bottleneck for now is compute.

They've recently made a big improvement to performance by introducing partial gpu acceleration if you compile with a gpu accelerated variant of BLAS. Either cublas (Nvidia) or CLBlast (slightly slower but supports almost everything: Nvidia, Apple, AMD, mobile, raspberry pi etc)

link

tarruda 1150 days ago

3 tokens/sec is a lot faster than what I experienced. Even though your CPU has a lot more cores, I think llama.cpp was not being able to make good use of more than 8 threads.

When did you test this? Maybe llama.cpp had some improvements since I used it (which was at the start of the project).

link

Ambix 1149 days ago

It's not about threads number, it about memory bottleneck. Sweet spot for my M1 Pro laptop is around 6 threads and 4bit model - I've managed to get 20 tokens per sec, really impressive

link

logicchains 1149 days ago

I tested this on the latest master. Llama.cpp has had some performance improvements, although I don't know if that'd be enough to make it 3x faster.

link

mrtranscendence 1149 days ago

That's just a bit faster than my MacBook Pro, for what it's worth. Which was quite expensive but I don't think AMD Epyc expensive ...

link

Ambix 1149 days ago

Is it Zen1 architecture? It should be much better on Zen2 and newer Epycs

link