Hacker News new | ask | show | jobs
by thentherewere2 905 days ago
RNNs and LSTMs from the past did this as well (but cannot be trained in parallel as each token has to be compressed sequentially). Transformers ate their cake.

Newer methods are going back to similar concepts but trying to get past previous bottlenecks given what we've learned since then about transformers.