Hacker News new | ask | show | jobs
by refulgentis 470 days ago
Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...

#1, should highlight it up front this time: We are talking about _G_PUs :)

#2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.

TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards

1 comments

1) This system mostly uses normal DDR RAM, not GPU VRAM.

2) M3 Ultra can load Deepseek R1 671B Q4.

Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.