|
|
|
|
|
by selfhoster11
335 days ago
|
|
It's challenging, but not impossible. With 2-bit quantisation, only about 250-ish gigabytes of RAM is required. It doesn't have to be VRAM either, and you can mix and match GPU+CPU inference. In addition, some people on /r/localLlama are having success with streaming the weights off SSD storage at 1 token/second, which is about the rate I get for DeepSeek R1. |
|