Hacker News new | ask | show | jobs
by thomasahle 809 days ago
>> But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions.

> Lately I've been wondering... is this a problem, or a strength?

Exactly. There are lot of use cases where perfect recall is important. And earlier data may be more or less incompressible, such as if an LLM is working on a large table of data.

Maybe we'll end up with different architectures being used for different applications. E.g. simple chat may be OK with an RNN type architecture.

I've also seen people combine Mamba and Transformer layers. Maybe that's a good tradeoff for some other applications.