I wonder if it is because using backpropagation all non-linear functions are chained together when the weights are learnt? Is it naive to think by that formulation the results will be quite similar since the final model equations are close?
Excellent write up. I've coincidentally been experimenting with the same thing. Any idea whether this approach could be used to speed up a search for exact solutions?
Indeed, CA can be represented by simple combinations of boolean functions, obviously by NN also, which is a combination of similar nonlinear functions.
What do you mean by 'the best'? Deeper architectures are popular because they quiet easy to train. They do work well in practice on many tasks (especially vision) but they have their limits.
Infinite wide networks are a newly active field and has recently shown some promising results, theoretically [1, 2] and empirically [3].
https://hardmath123.github.io/conways-gradient.html