| Some of the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... ! If we do not have read the foundations of the field that we are in, we are doomed to be mystified by unexplained phenomena which arise pretty naturally as consequences of already-distilled work! That said, the experiments seem very thorough, on a first pass/initial cursory examination, I appreciate the amount of detail that seemed to go into them. The tradeoff between learning existing theory, and attempting to re-derive it from scratch, I think, is a hard tradeoff, as not having the traditional foundation allows for the discovery of new things, but having it allows for a deeper understanding of certain phenomena. There is a tradeoff either way. I've seen several people here in the comments seemingly shocked that a model that maximizes the log likelihood of a sequence given the data somehow does not magically deviate from that behavior when run in inference. It's a density estimation model, do you want it to magically recite Shakespeare from the void? Please! Let's stick to the basics, it will help experiments like this make much more sense as there already is a very clear mathematical foundation which clearly explains it (and said emergent phenomena). If you want more specifics, there are several layers, Shannon's treatment of ergodic systems is a good start (though there is some minor deviation from that here, but it likely is a 'close enough' match to what's happening here to be properly instructive to the reader about the general dynamics of what is going on, overall.) |
> which clearly explains it (and said emergent phenomena)
Very smart information theory people have looked at neural networks through the lens of information theory and published famous papers about it years ago. It couldn't explain many things about neural networks, but it was interesting nonetheless.
FWIW it's not uncommon for smart people to say "this mathematical structure looks like this other idea with [+/- some structure]!!" and that it totally explains everything... (kind of with so and so exceptions, well and also this and that and..). Truthfully, we just don't know. And I've never seen theorists in this field actually take the theory and produce something novel or make useful predictions with it. It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).
There was this one posted recently on transformers being kernel smoothers: https://arxiv.org/abs/1908.11775