|
|
|
|
|
by veselin
149 days ago
|
|
What is the state of using quants? For chat models, a few errors or lost intelligence may matter a little. But what is happening to tool calling in coding agents? Does it fail catastrophically after a few steps in the agent? I am interesting if I can run it on a 24GB RTX 4090. Also, would vllm be a good option? |
|
Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.
edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.