Hacker News new | ask | show | jobs
by modeless 781 days ago
I don't need exact results. FP8 quantization is almost lossless and even 6-bit quantization is usually acceptable. Can this be combined with quantization?
2 comments

Yes. It's speculative decoding but instead of generating just a few sequential tokens with the draft model they generate a whole tree of some sort of optimal shape with hundreds of possible sequences.

It ends up being somewhat faster than regular speculative decoding in normal setting (GPU only). If you are doing CPU offloading it's massively faster.

Edit typo

> Can this be combined with quantization?

It is in their TODO part in https://github.com/Infini-AI-Lab/Sequoia/tree/main

INT8, not FP8