| HN Mirror

Their paper TurboQuant (TQ) is not new per say. It's released last year, and heavily rehash of old ideas that were released a year prior (RabitQ). There is also [a bit of drama](https://openreview.net/forum?id=tO3ASKZlok) there that boils down to what it seems a bit of malpractice for google's researchers. TQ does few things: it claims better compression quality and speed, and better KV cache handling. Currently KV cache takes a load of resources beside that of the model itself. Many people applied different quantization strategy for it, but the quality degradation is a too apparent. Enter Attention Rotation. This seems to have genuinely helped KV cache compression as per [llama.cpp latest tests](https://github.com/ggml-org/llama.cpp/pull/21038). On the other hand, [ik_llama.cpp](https://www.reddit.com/r/LocalLLaMA/comments/1s7nq6b/technic...) did tests on the quality of turboquant-3 compared to IQ4 quantized models, and yhe quality degradation is much worse. So it's 2 things: KV compression -> good. Turboquant quantazation -> not good.