|
|
|
|
|
by theanonymousone
618 days ago
|
|
I remember that, the way I understood it, Transformers solved two major "issues" of RNNs that enabled the later boom: Vanishing gradients limiting the context (and model?) size and difficulty in parallelisation limiting the size of the training data. Do we have solutions for these two problems now? |
|
RNN are constantly updating and overwriting their memory. It means they need to be able to predict what is going to be useful in order to store it for later.
This is a massive advantage for Transformers in interactive use cases like in ChatGPT. You give it context and ask questions in multiple turns. Which part of the context was important for a given question only becomes known later in the token sequence.
To be more precise, I should say it's an advantage of Attention-based models, because there are also hybrid models successfully mixing both approaches, like Jamba.