|
|
|
|
|
by miven
890 days ago
|
|
Now that I think about it, doesn't this "technique" triple the amount of compute and memory per generated token since each model needs to also compute and store the KV values for the two previous tokens it didn't generate and thus has never seen? Edit: On second thought, depending on how it's actually implemented the other two tokens are probably ran through the model in parallel so it shouldn't be all that much slower. |
|