| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bigyabai 52 days ago
	A 5T MoE model is still bottlenecked by streaming weights from SSD, in addition to compute bottlenecks during prefill and decode.

1 comments

zozbot234 52 days ago

True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.

link

bigyabai 52 days ago

You won't be RAM caching much of anything with experts that are 220b parameters worth of layers.

link