| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Barathkanna 186 days ago
	That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.

4 comments

bertili 186 days ago

People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.

link

NitpickLawyer 186 days ago

> 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s

Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.

link

zozbot234 186 days ago

If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.

Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.

link

YetAnotherNick 186 days ago

Depends on if you are using tensor parallelism or pipeline parallelism, in the second case you don't need any sharing.

link

omneity 186 days ago

RDMA over Thunderbolt is a thing now.

link