| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by walrus01 3 hours ago
	Before people go and drop a gargantuan sum of money on a server capable of running it entirely in GPU, there's still a fair amount of used x86-64 servers capable of running it in CPU and RAM (using llama-server) for probably under $6000. For example a Dell R640 with two older Xeon 18-core CPUs and 1TB of RAM. Test it out at a slow token/sec rate and see if it fits your needs. Same idea for Kimi.

2 comments

sgc 3 hours ago

To check whether I understand how this all works: Wouldn't a 4 bit quant run reasonably well (for that hardware) with far less ram, something like 1.5x the 476gb, or 714gb+ ram?

link

walrus01 3 hours ago

Yes, but the price difference between buying a used x86-64 server with 512GB and 1024GB isn't that great, and if you're already determined to buy the hardware to run in CPU a "large" model (eg: Not Qwen 3.6 35B-A3B, gemma4 or similar size), the loss of quality and sometimes suspicious nature of the output from a 4-bit quant might be undesirable vs running a Q8 quant or full precision.

You would also want a lot of RAM for context/kv cache to make it usable so just the amount of RAM that will fit a Q4 model and run it (before any cache starts getting populated through active use) isn't enough.

link

qingcharles 3 hours ago

Agreed. There are some crazy good deals on these older servers. For me, the inference speed would be fine as I'd just get on with a million other tasks between each response.

link