| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isoprophlex 308 days ago
	Extremely impressive, but can one really run these >200B param models on prem in any cost effective way? Even if you get your hands on cards with 80GB ram, you still need to tie them together in a low-latency high-BW manner. It seems to me that small/medium sized players would still need a third party to get inference going on these frontier-quality models, and we're not in a fully self-owned self-hosted place yet. I'd love to be proven wrong though.

2 comments

Borealid 308 days ago

A Framework Desktop exposes 96GB of RAM for inference and costs a few thou USD.

link

michaelanckaert 308 days ago

You need memory on the GPU, not in the system itself (unless you have unified memory such as the M-architecture). So we're talking about cards like the H200 that have 141GB of memory and cost between 25 to 40k.

link

Borealid 308 days ago

Did you casually glance at how the hardware in the Framework Desktop (Strix Halo) works before commenting?

link

michaelanckaert 308 days ago

I didn't glace at it, I read it :-) The architecture is a 'unified memory bus', so yes the GPU has access to that memory.

My comment was a bit unfortunate as it implied I didn't agree with yours, sorry for that. I simply want to clarify that there's a difference between 'GPU memory' and 'system memory'.

The Frame.work desktop is a nice deal. I wouldn't buy the Ryzen AI+ myself, from what I read it maxes out at about 60 tokens / sec which is low for my use cases.

link

ramon156 308 days ago

These don't run 200B models at all, results show it can run 13B at best. 70B is ~3 tk / s according to someone on Reddit.

link

Borealid 307 days ago

I don't know where you've got those numbers, but they're wrong.

https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/inferen... seems comparable to the Framework Desktop and reputable - they didn't just quote a number, they showed benchmark output.

I get far more than 3 t/s for a 70B model on normal non-unified RAM, so that's completely unfeasible performance for a unified memory architecture like Halo.

link

mhast 307 days ago

It depends on the model.

It's typically ok for MoE models but if you try to run something non-MoE the speed will plummet. In that same thread there are people getting 50 tok/s on MoE models and 5 on non MoE. (https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/comment...)

And while it has unified memory the memory is quite slow. 250GB/s compared to 500+ for M4 Max or 1800 GB/s for a 5090. So it's fast for a CPU, but pretty slow for a GPU.

(That said, there are not a lot of cheap options for running large models locally. They all have significant compromises.)

link

buyucu 308 days ago

I'm running them on GMKTec Evo 2.

link