|
|
|
|
|
by zozbot234
122 days ago
|
|
100B+ is the amount of total parameters, whereas what matters here is active - very different for sparse MoE models. You're right that there's some overhead for the OS/software stack but it's not that much. KV-cache is a good candidate for being swapped out, since it only gets a limited amount of writes per emitted token. |
|
Once you're swapping from disk, the performance will be quite unusable for most people. And for local inference, KV cache is the worst possible choice to put on disk.