|
|
|
|
|
by EruditeCoder108
86 days ago
|
|
This is less about “running a 400B model on a phone” and more about clever engineering around constraints.
What’s actually happening is:
in mixture-of-experts only a small subset of weights is active per token
Aggressive quantization
Streaming weights from storage instead of loading everything into RAM
So the effective working set is much smaller than 400B.
That said, the trade-offs are obvious: very low token throughput, high latency, and heavy reliance on storage bandwidth. It’s more of a proof-of-concept than something usable. |
|