| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by modeless 781 days ago
	I don't need exact results. FP8 quantization is almost lossless and even 6-bit quantization is usually acceptable. Can this be combined with quantization?

2 comments

mmoskal 781 days ago

Yes. It's speculative decoding but instead of generating just a few sequential tokens with the draft model they generate a whole tree of some sort of optimal shape with hundreds of possible sequences.

It ends up being somewhat faster than regular speculative decoding in normal setting (GPU only). If you are doing CPU offloading it's massively faster.

Edit typo

link

dimask 781 days ago

> Can this be combined with quantization?

It is in their TODO part in https://github.com/Infini-AI-Lab/Sequoia/tree/main

link

alecco 780 days ago

INT8, not FP8

link