Hacker News new | ask | show | jobs
by dulakian 469 days ago
I am using the Q6_K_L quant and it's running at about 40G of vram with the KV cache.

Device 1 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||20.170Gi/23.988Gi]

Device 2 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||19.945Gi/23.988Gi]

1 comments

What's the context length?
The model has a context of 131,072, but I only have 48G of VRAM so I run it with a context of 32768.