| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aurareturn 470 days ago

It's both compute and bandwidth constrained - just like trying to run Crysis on CPU rendering.

A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.

However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.

Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.

1 comments

refulgentis 470 days ago

Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...

#1, should highlight it up front this time: We are talking about _G_PUs :)

#2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.

TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards

aurareturn 470 days ago

1) This system mostly uses normal DDR RAM, not GPU VRAM.

2) M3 Ultra can load Deepseek R1 671B Q4.

Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.