|
|
|
|
|
by rsolva
29 days ago
|
|
I have experimented with both q8 and q4 for KV cache. I can't find any difference between q8 and fp16, but q4 suffers more when the context grows. q8 seems like a good compromise and gives us enough ctx for about 6-8 concurrent, full context sessions. But we have not fully tested those limits yet, as the context windows rarely reach the limit. |
|