Hacker News new | ask | show | jobs
by buildbot 473 days ago
Ah, I didn’t realize they’d upped the memory bandwidth to DDR5-6000 (vs 4800), thanks for the correction!

The memory bandwidth does not double, I believe. See this random issue for a graph that has single/dual socket measurements, there is essentially no difference: https://github.com/abetlen/llama-cpp-python/issues/1098

Perhaps this is incorrect now, but I also know with 2x 4090s you don’t get higher tokens per second than 1x 4090 with llama.cpp, just more memory capacity.

(All if this only applies to llama.cpp, I have no experience with other software and how memory bandwidth may scale across sockets)

1 comments

The memory bandwidth does double, but in order to exploit it the program must be written and executed with care in the memory placement, taking into account NUMA, so that the cores should access mostly memory attached to the closest memory controller and not memory attached to the other socket.

With a badly organized program, the performance can be limited not by the memory bandwidth, which is always exactly double for a dual-socket system, but by the transfers on the inter-socket links.

Moreover, your link is about older Intel Xeon Sapphire Rapids CPUs, with inferior memory interfaces and with more quirks in memory optimization.

Yes, I believe in theory a correctly written program could scale across sockets, depending on the problem at hand.

But where is your data? For llama.cpp? For whatever dual socket CPU system you want. That’s all I am claiming.

Googling for what you ask has found immediately this discussion:

https://github.com/ggml-org/llama.cpp/discussions/11733

about the scaling of llama.cpp and DeepSeek on some dual-socket AMD systems.

While it was rather tricky, after many experiments they have obtained an almost double speed on two sockets, especially on AMD Turin.

However, if you look at the actual benchmark data, that must be much lower than what is really possible, because their test AMD Turin system (named there P1) had only two thirds of the memory channels populated, i.e. performance limited by memory bandwidth could be increased by 50%, and they had 16-core CPUs, so performance limited by computation could be increased around 10 times.

Cool, I didn’t find that one! Thanks.

A single 192 core Epyc is 11k by itself, so I’d probably go for the simpler integrated M3 ultra solution…