| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by modeless 807 days ago
	It's a start but it's disappointing that half the layers still have to process every token. It seems like we ought to be able to get to 90% or even 99% savings when these models currently allocate the same compute for outputting "the" as they do for outputting the first digit of the answer of a complicated math problem.

1 comments

aiddun 807 days ago

Speculative decoding does this to an extent - using a smaller model to generate its own predictions and putting them in the batch of the bigger model until they diverge

https://huggingface.co/blog/whisper-speculative-decoding

link

brrrrrm 807 days ago

It doesn’t. It simply trades compute efficiency by transposing matrix multiplications into “the future.” It doesn’t actually save FLOPs (uses more) and doesn’t work at large batch size

link

imtringued 807 days ago

>doesn’t actually save FLOPs (uses more)

Does anyone even care? Really, who cares? The truth is nobody cares. Saving FLOPs does nothing if you have to load the entire model anyway. Going from two flops per parameter to 0.5 or whatever might sound cool on paper but you're loading those parameters anyway and gained nothing.

link

brrrrrm 803 days ago

companies that run these things care - they run at huge batch size and are compute bound in the limit

link