Hacker News new | ask | show | jobs
by ActorNightly 5 days ago
>I can't help but think that there's got to be a better mechanism

There is.

Transformers are basically autoencoders on the decode step - they take a compressed set of information and expand it into a 3 matrices which then get combined back into one matrix.

You can unroll the entire self attention step into fully connected layers, just with a lot of zeros for things that don't get multiplied together.

So it stands to reason that there is probably an optimal form of weights that does the same thing as current transformers.