|
|
|
|
|
by PoignardAzur
909 days ago
|
|
Some teams have researched ways to do this. For instance, you can have a smaller model generate ten tokens in sequence, and then ask the larger mode "given these N tokens, what is the token N+1" ten times in parallel. If the large and small model agree on, say, the first 7 tokens, then you keep these and throw the next 3 away and start over. So you still have to run the large model for each token, but you can at least do batch calculations (which is a lot more efficient, because loading layer weights is the bottleneck, not matrix ops). |
|