Hacker News new | ask | show | jobs
by lerela 993 days ago
We have clarified the documentation, sorry about the confusion! 16GB should be enough but it requires some vLLM cache tweaking that we still need to work on, so we put 24GB to be safe. Other deployment methods and quantized versions can definitely fit on 16GB!
1 comments

Shouldn't it be much less than 16GB with vLLM's 4-bit AWQ? Probably consumer GPU-ish depending on the batch size?