|
|
|
|
|
by zozbot234
271 days ago
|
|
My understanding is that RNN and LSTM (potentially augmented with some bespoke attention mechanism) have the potential to be more training- and inference-efficient than the transformer models that are more common today, but transformers were adopted because of their unique ability to greatly scale up and parallelize training in a way that just isn't feasible with the older models. So transformers can get better outcomes from their allowed scale of compute despite possibly being less compute-efficient overall. |
|
(not trying to say it wasn't an incredible accomplishment for the authors. There are quite a few details to get right in order to get to the "obvious" advance)
Even today it's pretty obvious how such a thing might be further extended. Create a network so big in input it just contains and does attention across it's entire dataset. Such a network would only need basic understanding of language and would not hallucinate. Also it'd be obvious where anything came from, as the attention vectors would show what data was used by what specific part of the network. But this is a theoretical exercise as all the compute power in the world can't do that for any decent size dataset.
RNN and LSTM are more training and inference efficient because they don't do this. They do compute for every token and more-or-less then just add every thought any part of the network had together, sequentially.
We need to go the opposite direction from attention. It has to be the case that attention is extremely inefficient.