| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by selfhoster11 335 days ago
	It's challenging, but not impossible. With 2-bit quantisation, only about 250-ish gigabytes of RAM is required. It doesn't have to be VRAM either, and you can mix and match GPU+CPU inference. In addition, some people on /r/localLlama are having success with streaming the weights off SSD storage at 1 token/second, which is about the rate I get for DeepSeek R1.