Hacker News new | ask | show | jobs
by devmor 78 days ago
Local models just don't seem that useful for me for these particular tasks yet - the most recent versions of Codex and Claude Opus are the first time I've found them to be particularly useful in a "real engineering" context that isn't just vibe coding.

Google's TurboQuant might help address this, but it also might just widen the gap even further.

I am far on the skeptic edge when it comes to the generative AI side of ML tools though, so do take my opinion with that weight.

1 comments

Turboquant is totally irrelevant compared to current quantization methods. It has been thoroughly test by people who build inferencing engines for local models. It's all talk no actual meat to it.
Do you have any reading on this? I find it hard to believe something announced a week ago has been “thoroughly tested”.
Their paper TurboQuant (TQ) is not new per say. It's released last year, and heavily rehash of old ideas that were released a year prior (RabitQ). There is also [a bit of drama](https://openreview.net/forum?id=tO3ASKZlok) there that boils down to what it seems a bit of malpractice for google's researchers. TQ does few things: it claims better compression quality and speed, and better KV cache handling. Currently KV cache takes a load of resources beside that of the model itself. Many people applied different quantization strategy for it, but the quality degradation is a too apparent. Enter Attention Rotation. This seems to have genuinely helped KV cache compression as per [llama.cpp latest tests](https://github.com/ggml-org/llama.cpp/pull/21038). On the other hand, [ik_llama.cpp](https://www.reddit.com/r/LocalLLaMA/comments/1s7nq6b/technic...) did tests on the quality of turboquant-3 compared to IQ4 quantized models, and yhe quality degradation is much worse. So it's 2 things: KV compression -> good. Turboquant quantazation -> not good.