Y
Hacker News
new
|
ask
|
show
|
jobs
by
nl
52 days ago
> You could run it on a cluster of nodes
Not sure this is a MBP either.
1 comments
bigyabai
52 days ago
Not even a cluster of Mac Pros could run a dense 5T parameter model with RDMA, to my knowledge.
link
zozbot234
52 days ago
SOTA models are reportedly MoE, not dense.
link
bigyabai
52 days ago
A 5T MoE model is still bottlenecked by streaming weights from SSD, in addition to compute bottlenecks during prefill and decode.
link
zozbot234
52 days ago
True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.
link
bigyabai
52 days ago
You won't be RAM caching much of anything with experts that are 220b parameters worth of layers.
link