| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by azeirah 706 days ago

You don't need a 4090 at all. 16 bit requires about 24GB of VRAM, 8bit quants (99% same performance) requires only 12GB of VRAM.

That's without the context window, so depending on how much context you want to use you'll need some more GB.

That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)

This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.

You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.