Hacker News new | ask | show | jobs
by ActorNightly 894 days ago
>What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them.

They do that in theory. In practice, its just all matrix multiplication. You could easily structure a transformer as a bunch of fully connected deep layers and it would be mathematically equivalent, just computationally inefficient.