|
|
|
|
|
by breckenedge
890 days ago
|
|
It doesn’t generate three responses for every turn. It randomly picks a model for every response, the claim being that the switching between different models leads to better conversations because of the diversity of each model’s training. |
|
But here since the previous few tokens were produced by another model, the current model has never seen them and as such, by definition, doesn't have those calculations stored, but it still needs them to properly calculate attention for the new token.