Hacker News new | ask | show | jobs
by la_oveja 469 days ago
5 figures? can be done in 6k https://x.com/carrigmat/status/1884244369907278106
2 comments

That's CPU only memory, not high bandwidth, and not addressable by the GPU.
addressable is a weird choice of words here.

CUDA has had managed memory for a long time now. You absolutely can address the entire host memory from your GPU. It will fetch it, if it's needed. Not fast, but addressable.

Windows has been doing this since what... the AGP era? Though this is a function of the ISA rather than the OS.
There isn't anything particularly high-bandwidth about Apple's DDR5 implementation, either. They just have a lot of channels, which is why I compared it to a 24-channel EPYC system. I agree that their integrated GPU architecture hits a unique design point that you don't get from nvidia, who prefer to ship smaller amounts of very different kinds of memory. Apple's architecture may be more suited to some workloads but it hasn't exactly grabbed the machine learning market.
M3 Ultra has 819GB/s, and a single epyc cpu with 12 channels has 460GB/s. As far as I know, llama.cpp and friends don’t scale across multiple sockets so you can’t use a dual socket Turin system to match the M3 Ultra.

Also, 32GB DDR5 RDIMMS are ~200, so that’s 5K for 24 right there. Then you need 2x CPUs, at ~1K for the cheapest, and you need 2, and then a motherboard that’s another 1K. So for 8K (more, given you need a case, power supply, and cooling!), you get a system with about half the memory bandwidth, much higher power consumption, and very large.

Partial correction, an Epyc CPU with 12 channels has 576 GB/s, i.e. DDR5-6000 x 768 bits. That is 70% of the Apple memory bandwidth, but with possibly much more memory (768 GB in your example).

You do not need 2 CPUs. If however you use 2 CPUs, then the memory bandwidth doubles, to 1152 GB/s, exceeding Apple by 40% in memory bandwidth. The cost of the memory would be about the same, by using 16 GB modules, but the MB would be more expensive and the second CPU would add to the price.

Ah, I didn’t realize they’d upped the memory bandwidth to DDR5-6000 (vs 4800), thanks for the correction!

The memory bandwidth does not double, I believe. See this random issue for a graph that has single/dual socket measurements, there is essentially no difference: https://github.com/abetlen/llama-cpp-python/issues/1098

Perhaps this is incorrect now, but I also know with 2x 4090s you don’t get higher tokens per second than 1x 4090 with llama.cpp, just more memory capacity.

(All if this only applies to llama.cpp, I have no experience with other software and how memory bandwidth may scale across sockets)

The memory bandwidth does double, but in order to exploit it the program must be written and executed with care in the memory placement, taking into account NUMA, so that the cores should access mostly memory attached to the closest memory controller and not memory attached to the other socket.

With a badly organized program, the performance can be limited not by the memory bandwidth, which is always exactly double for a dual-socket system, but by the transfers on the inter-socket links.

Moreover, your link is about older Intel Xeon Sapphire Rapids CPUs, with inferior memory interfaces and with more quirks in memory optimization.

CPUs do not have enough compute typically. You'll be compute bottlenecked before bandwidth if the model is large enough.

Time to first token, context length, and tokens/s are significantly inferior on CPUs when dealing with larger models even if the bandwidth is the same.

One big server CPUs can have a computational capability similar to a mid-range desktop NVIDIA GPU.

When used for ML/AI applications, a consumer GPU has much better performance per dollar.

Nevertheless, when it is desired to use much more memory than in a desktop GPU, a dual-socket server can have higher memory bandwidth than most desktop GPUs, i.e. more than an RTX 4090, and a computational capability that for FP32 could exceed an RTX 4080, but it would be slower for low-precision data where the NVIDIA tensor cores can be used.

Now compare the FLOPs
The bandwidth difference likely doesn't make a difference though. Benchmarks of Apple Silicon show that the compute bottlenecks far before running out of bandwidth, even when fully loading all CPU cores, the GPU, etc.
Ah seems like I remembered the CPU price for a higher tier CPU which can cost the 6k on their own.

Thinking about it you can get a decent 256gb on consumer platforms now too, but the speed will be a bit crap and would need to make sure the platform ully supports ECC UDIMMs