Hacker News new | ask | show | jobs
by rar00 353 days ago
This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state.
1 comments

That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not