Hacker News new | ask | show | jobs
by spyder 267 days ago
This is false:

"that's because a next token predictor can't "forget" context. That's just not how it works."

An LSTM is also a next token predictor and literally have a forget gate, and there are many other context compressing models too which remember only the what it thinks is important and forgets the less important, like for example: state-space models or RWKV that work well as LLMs too. But even just a the basic GPT model forgets old context since it's gets truncated if it cannot fit, but that's not really the learned smart forgetting the other models do.