|
|
|
|
|
by fluffyspork
10 days ago
|
|
According to Gemini to run full 256k context window with unified RAM 4-bit Quantized (Q4_K_M GGUF): You need at least 25 GB of total RAM. This is the most practical configuration for consumer hardware. 8-bit Quantized (Q8_0 / SFP8): You need at least 32 GB to 36 GB of RAM Uncompressed 16-bit (BF16): You will need upwards of 45 GB to 50 GB of RAM to account for both the 26.7 GB base model and the massive KV cache. |
|
I tested this with the at the time newest llama.cpp master on a Linux system with 2 3090 24GB, only one was used for testing. q8 without any KV quant, 256k context, mmproj loaded takes less than 20GB VRAM. This runs at about 1.5k to 2k tok/s pp and 40-50 tok/s gen (slightly lowered power limits & undervolted). q8 with 64k non-quant context and mmproj takes just under 16GB VRAM. Drop down to the q6k model, no mmproj, 64k non-quant context and it fits in 12GB VRAM. All the way down to q4km and some batch size tweaking and it barely fits into 8GB VRAM.
64k context is the minimum for Hermes agent, so a vision capable "agentic" model fits into a 16GB card. This is very impressive. I am currently testing how smart the model is and it does decently so far, had one looping issue it recovered after a lot of tokens, did some basic tool calling.