| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sroussey 58 days ago

Super interesting!

> - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.

Where can I find more info on this? I’d like to convert models to onnx this way.

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

Where can I find more info on this? I’d like to convert models to onnx this way.

The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment.

2 comments

simjnd 57 days ago

For TurboQuant on model weights AFAIK it's currently a single person effort [1]. It needs his fork of llama.cpp, hasn't been upstreamed. He publishes his quantizations on HuggingFace but I'm not sure if he open-sourced the quantization pipeline.

[1]: https://x.com/coffeecup2020

link

hadlock 58 days ago

Google only released their TurboQuant paper barely a month ago, it is bleeding edge even by LLM standards

link

sroussey 58 days ago

Actually, they published a year ago. Recent was being on official Google blog.

https://arxiv.org/abs/2504.19874

https://research.google/blog/turboquant-redefining-ai-effici...

link