|
|
|
|
|
by spyder
267 days ago
|
|
This is false: "that's because a next token predictor can't "forget" context. That's just not how it works." An LSTM is also a next token predictor and literally have a forget gate, and there are many other context compressing models too which remember only the what it thinks is important and forgets the less important, like for example: state-space models or RWKV that work well as LLMs too.
But even just a the basic GPT model forgets old context since it's gets truncated if it cannot fit, but that's not really the learned smart forgetting the other models do. |
|