|
|
|
|
|
by jacobsimon
775 days ago
|
|
But is this lookup mechanism available from one token prediction to the next? I’ve heard conflicting things, with others saying that transformers are stateless and therefore don’t share this information across prediction steps. I might be misunderstanding something fundamental. |
|
So if you wished you could implement a transformer by recomputing everything on every token. That would be incredibly inefficient. However, if you're continuing a conversation with an LLM you likely would recompute all the state for all tokens on each new user input, because the alternative is to store all that state in memory until the user gets back to you again a minute later. If you have too many simultaneous users you won't have enough VRAM for that. (In some cases moving it out of VRAM temporarily might be practical.)