Hacker News new | ask | show | jobs
by rsolva 29 days ago
I have experimented with both q8 and q4 for KV cache. I can't find any difference between q8 and fp16, but q4 suffers more when the context grows. q8 seems like a good compromise and gives us enough ctx for about 6-8 concurrent, full context sessions. But we have not fully tested those limits yet, as the context windows rarely reach the limit.