| HN Mirror

I wonder about that. I mean, attention is sort-of obvious isn't it? Just calculate across all data available, look at every moment in time simultaneously. I know the point of attention is to select the correct data and process that, but in order to select the correct data we do a truly massive matrix multiplication. That that was going to work, I believe, was not a mystery even in 1960. It just wasn't possible, and even in 2016 it was not really possible to train with such a principle outside of the FANGs.

(not trying to say it wasn't an incredible accomplishment for the authors. There are quite a few details to get right in order to get to the "obvious" advance)

Even today it's pretty obvious how such a thing might be further extended. Create a network so big in input it just contains and does attention across it's entire dataset. Such a network would only need basic understanding of language and would not hallucinate. Also it'd be obvious where anything came from, as the attention vectors would show what data was used by what specific part of the network. But this is a theoretical exercise as all the compute power in the world can't do that for any decent size dataset.

RNN and LSTM are more training and inference efficient because they don't do this. They do compute for every token and more-or-less then just add every thought any part of the network had together, sequentially.

We need to go the opposite direction from attention. It has to be the case that attention is extremely inefficient.