|
|
|
|
|
by theonlybutlet
900 days ago
|
|
Thanks for your reply, you raise a very good point, transformer models are a lot more complex. I'd argue conceptually they're the same, just the data and process is more abstracted. Autoencoded data implies using efficient representations, basically semantically abstracted data and opting for measures like back propagation through time. |
|
"For example, when doing the backpropagation (the technique through which the models learn), the gradients can become too large"
But I think this is more of a borrowing and it's not used again in description and may just be a misconception. There's no use of the Backprop term in the original paper nor any stage of learning where output errors are run thru the whole network in a deep regression.
What I do see in Transformers is localized uses of gradient descent, and Backprop in NNs also uses GD...but that seems the extent of it.
Is there a deep regression? Maybe I'm missing it