| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dulakian 469 days ago

I am using the Q6_K_L quant and it's running at about 40G of vram with the KV cache.

Device 1 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||20.170Gi/23.988Gi]

Device 2 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||19.945Gi/23.988Gi]

1 comments

What's the context length?

The model has a context of 131,072, but I only have 48G of VRAM so I run it with a context of 32768.