Hacker News new | ask | show | jobs
by evnc 916 days ago
This is fair -- the newest token can attend perfectly to the oldest token, within the context window.

but also, on a broader scale, if a transformer model is presented with a long input that does not fit in its context (e.g.: you are building a chatbot, and you have a very long chat history), it must "compress" or "forget" some of that information (e.g.: repeatedly summarizing historical messages, dropping them and appending the summary at the beginning of the input).

Mamba/RWKV/other "recurrent" architectures, can theoretically operate on unbounded input lengths; they "forget" information from earlier tokens over time, but is that not comparable to what a transformer must do with input lengths greater than their context window?