| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 3abiton 122 days ago
	This is the bet of many of the big AI companies, and why they're subsidizing majorly the calls. With the latest cracks by the US gov, it seems Anthropic is starting to reduce those subsidies given their edge in the game. I am starting to consider local models more seriously beside just testing, but nowadays the ram/gpu market is bloated.

1 comments

devmor 122 days ago

Local models just don't seem that useful for me for these particular tasks yet - the most recent versions of Codex and Claude Opus are the first time I've found them to be particularly useful in a "real engineering" context that isn't just vibe coding.

Google's TurboQuant might help address this, but it also might just widen the gap even further.

I am far on the skeptic edge when it comes to the generative AI side of ML tools though, so do take my opinion with that weight.

link

3abiton 122 days ago

Turboquant is totally irrelevant compared to current quantization methods. It has been thoroughly test by people who build inferencing engines for local models. It's all talk no actual meat to it.

link

devmor 121 days ago

Do you have any reading on this? I find it hard to believe something announced a week ago has been “thoroughly tested”.

link

3abiton 121 days ago

Their paper TurboQuant (TQ) is not new per say. It's released last year, and heavily rehash of old ideas that were released a year prior (RabitQ). There is also [a bit of drama](https://openreview.net/forum?id=tO3ASKZlok) there that boils down to what it seems a bit of malpractice for google's researchers. TQ does few things: it claims better compression quality and speed, and better KV cache handling. Currently KV cache takes a load of resources beside that of the model itself. Many people applied different quantization strategy for it, but the quality degradation is a too apparent. Enter Attention Rotation. This seems to have genuinely helped KV cache compression as per [llama.cpp latest tests](https://github.com/ggml-org/llama.cpp/pull/21038). On the other hand, [ik_llama.cpp](https://www.reddit.com/r/LocalLLaMA/comments/1s7nq6b/technic...) did tests on the quality of turboquant-3 compared to IQ4 quantized models, and yhe quality degradation is much worse. So it's 2 things: KV compression -> good. Turboquant quantazation -> not good.

link