Hacker News new | ask | show | jobs
by yshui 534 days ago
Any autoregressive model can do what you are describing. transformers are generating one token at a time too, not all at once.
2 comments

True but memory requirements grow with sequence length. For recurrent models the memory requirement is constant. This is why I qualified with "low memory".
yes but transformers are much slower than state space models