| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mholm 149 days ago
	The reason Macs get recommended is the unified memory, which is usable as VRAM for the GPU. People are similarly using the AMD Strix Halo for AI which also has a similar memory architecture. Time to first token for something like '1+1=' would be seconds, and then you'd be getting ~20 tokens per second, which is absolutely plenty fast for regular use. Token/s slows down at the higher end of context, but it's absolutely still practical for a lot of usecases. Though I agree that agentic coding, especially over large projects, would likely get too slow to be practical.

2 comments

PlatoIsADisease 149 days ago

We are getting into a debate between particulars and universals. To call the 'unified memory' VRAM is quite a generalization. Whatever the case, we can tell from stock prices that whatever this VRAM is, its nothing compared to NVIDIA.

Anyway, we were trying to run a 70B model on a macbook(can't remember which M model) at a fortune 20 company, it never became practical. We were trying to compare strings of character length ~200. It was like 400-ish characters plus a pre-prompt.

I can't imagine this being reasonable on a 1T model, let alone the 400B models of deepseek and LLAMA.

link

Gracana 149 days ago

With 32B active parameters, Kimi K2.5 will run faster than your 70B model.

link

simonw 149 days ago

Here's a video of a previous 1T K2 model running using MLX on a a pair of Mac Studios: https://twitter.com/awnihannun/status/1943723599971443134 - performance isn't terrible.

link

PlatoIsADisease 149 days ago

Is there a catch? I was not getting anything like this on a 70B model.

EDIT: oh its a marketing account and the program never finished... who knows the validity.

link

simonw 149 days ago

I don't think Awni should be dismissed as a "marketing account" - they're an engineer at Apple who's been driving the MLX project for a couple of years now, they've earned a lot of respect from me.

link

PlatoIsADisease 149 days ago

Given how secretive Apple is, oh my, its super duper marketing account.

link

mholm 149 days ago

Jeff Geerling and a few others also got access to similarly specced mac clusters. They replicated this performance.

The tooling involved has improved significantly over the past year.

link

zozbot234 149 days ago

Not too slow if you just let it run overnight/in the background. But the biggest draw would be no rate limits whatsoever compared to the big proprietary APIs, especially Claude's. No risk of sudden rugpulls either, and the model will have very consistent performance.

link