Hacker News new | ask | show | jobs
by 4death4 786 days ago
MLP is a universal approximator, so there’s definitely a configuration that can match an attention mechanism. Whether or not it’d be feasible to train is another question.
1 comments

Not sure about feasible, but certainly not efficient.

I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.

I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.