|
|
|
|
|
by lloyd-christmas
6 days ago
|
|
Not the person you asked, but I have a 9700 which has the same VRAM, and running Q6 on it with unquantized kv gives me 50k context. Putting -ctv q8_0 ups that to 70k. I normally run Q4 with unquantized kv @ 130k at 50 t/s (mtp 3), with the disclaimer that I'm running PCIe gen4x8, so I'm slightly slowed. I've found that quantizing k leads to broken json on tool calls, which is fairly unrecoverable, but YMMV. |
|