| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by EruditeCoder108 86 days ago
	This is less about “running a 400B model on a phone” and more about clever engineering around constraints. What’s actually happening is: in mixture-of-experts only a small subset of weights is active per token Aggressive quantization Streaming weights from storage instead of loading everything into RAM So the effective working set is much smaller than 400B. That said, the trade-offs are obvious: very low token throughput, high latency, and heavy reliance on storage bandwidth. It’s more of a proof-of-concept than something usable.

2 comments

I’ve seen this story making the rounds and I’m not just why it’s gotten so much traction. Is it just a good write up?

Thanks, bot.

Wouldn't a bot write better English? Or are they optimized to produce bad grammar already?

This isn't bad grammar, it's bad formatting because it was copy-pasted from somewhere and the newlines didn't take.