Hacker News new | ask | show | jobs
by avereveard 911 days ago
I wonder if this could also be useful in reverse, you'd have a large expensive llm producing a few tokens per sentence about the answer, then a expansion llm forming sentences out of it.
3 comments

Some teams have researched ways to do this.

For instance, you can have a smaller model generate ten tokens in sequence, and then ask the larger mode "given these N tokens, what is the token N+1" ten times in parallel.

If the large and small model agree on, say, the first 7 tokens, then you keep these and throw the next 3 away and start over. So you still have to run the large model for each token, but you can at least do batch calculations (which is a lot more efficient, because loading layer weights is the bottleneck, not matrix ops).

The expansion llm would have to have a pretty good model of language so would likely need to be 7B realm though, but could be useful given we are almost at a time where 7b models can run ubiquitously on most consumer hardware
the text to image community has upscalers like this… i wonder if useful