|
|
|
|
|
by evnc
916 days ago
|
|
This is fair -- the newest token can attend perfectly to the oldest token, within the context window. but also, on a broader scale, if a transformer model is presented with a long input that does not fit in its context (e.g.: you are building a chatbot, and you have a very long chat history), it must "compress" or "forget" some of that information (e.g.: repeatedly summarizing historical messages, dropping them and appending the summary at the beginning of the input). Mamba/RWKV/other "recurrent" architectures, can theoretically operate on unbounded input lengths; they "forget" information from earlier tokens over time, but is that not comparable to what a transformer must do with input lengths greater than their context window? |
|