Hacker News new | ask | show | jobs
by miven 890 days ago
Now that I think about it, doesn't this "technique" triple the amount of compute and memory per generated token since each model needs to also compute and store the KV values for the two previous tokens it didn't generate and thus has never seen?

Edit: On second thought, depending on how it's actually implemented the other two tokens are probably ran through the model in parallel so it shouldn't be all that much slower.

2 comments

It doesn’t generate three responses for every turn. It randomly picks a model for every response, the claim being that the switching between different models leads to better conversations because of the diversity of each model’s training.
Correct me if I'm wrong but usually when you do normal token by token inference in a transformer you store calculations made in the previous step in a KV cache so you can reuse it instead of calculating it all over again.

But here since the previous few tokens were produced by another model, the current model has never seen them and as such, by definition, doesn't have those calculations stored, but it still needs them to properly calculate attention for the new token.

It doesn’t appear to be token-by-token inference. Each new completion uses a different model, but the new completion is entirely created by that model.
It reads like that, yeah. Although 3 x 6B is still an order of magnitude smaller than ChatGPT's purported 175B