|
|
|
|
|
by joefourier
400 days ago
|
|
While true in a theoretical sense (an MLP of sufficient size can theoretically represent any differentiable function), in practice it’s often the case that it’s impossible for a certain architecture to learn a specific task no matter how much compute you throw at it. E.g. an LSTM will never capture long range dependencies that a transformer could trivially learn, due to gradients vanishing after a certain sequence length. |
|
However, for example, a Transformer can be represented with just deeply connected layers, albeit with a lot of zeros for weights.