Hacker News new | ask | show | jobs
by rft 10 days ago
Those numbers are wrong for most use cases. Likely the LLM did not take SWA (Sliding Window Attention) of G4 into account. Without SWA those numbers could be correct, I can't load a q8 without SWA on a 24GB card.

I tested this with the at the time newest llama.cpp master on a Linux system with 2 3090 24GB, only one was used for testing. q8 without any KV quant, 256k context, mmproj loaded takes less than 20GB VRAM. This runs at about 1.5k to 2k tok/s pp and 40-50 tok/s gen (slightly lowered power limits & undervolted). q8 with 64k non-quant context and mmproj takes just under 16GB VRAM. Drop down to the q6k model, no mmproj, 64k non-quant context and it fits in 12GB VRAM. All the way down to q4km and some batch size tweaking and it barely fits into 8GB VRAM.

64k context is the minimum for Hermes agent, so a vision capable "agentic" model fits into a 16GB card. This is very impressive. I am currently testing how smart the model is and it does decently so far, had one looping issue it recovered after a lot of tokens, did some basic tool calling.