| Sorry I wasn't more clear! I'm referring to the Kolmogorov complexity of the network. The OP said: > With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that. And they're not wrong! An ideally trained network could, in principle, learn the data-generating program, if that program is within its class of representable functions. I might have a NN that naively looks like it takes up GBs of space, but it might actually be parameterizing a much simpler function (hence our ability to prune/compress the weights without performance loss - most of the capacity wasn't being used for any interesting computation). You're right that there's no guarantee that the model finds the most "dense" representation. The goal of regularization is to encourage that, though! All over the place in ML there are bounds like: test loss <= train loss + model complexity Hence minimizing model complexity improves generalization performance. This is a kind of Occam's Razor: the simplest model generalizes best. So the OP is on the right track - we definitely want networks to learn the "underlying" process that explains the data, which in this case would be a latent representation of the source code (well, except that doesn't really make sense since you'd need the whole rest of the compute stack that code runs on - the neural net has no external resources/embodied complexity it calls, unlike the source code which gets to rely on drivers, hardware, operating systems, etc.) |
I suggested that this probably isn't the case here since they tried distillation and saw no effect. While this isn't proof that this particular model can't be compressed more it does suggest that it is non-trivial. This is especially true given the huge difference in size. I mean we're talking about 700x...
Where I think our disagreement is in that I read the OP as saying __this__ network. If we're talking about a theoretical network, well... nothing I said anywhere is in any disagreement with that. I even said in the post I linked to that the difference shows that there's still a long way to go but that this is still cool. Why did I assume OP was talking about __this__ network? Well because we're in a thread talking about a paper and well... yes, we're talking about compression machines so theoretically (well not actually supported by any math theory) this is true for so many things and that is a bit elementary. So makes more sense (imo) that we're talking about this network. And I wanted to make it clear that this network is nowhere near compression. Can further research later result in something that is better than the source code? Who knows? For all the reasons we've both mentioned. We know they are universal approximators (which are not universal mimicers and have limits) but we have no guarantee of global convergence (let alone proof such a thing exists in many problems).
And I'm not sure why you're trying to explain the basic concepts to me. I mentioned I was an ML researcher. I see you're a PhD at Oxford. I'm sure you would be annoyed if I was doing the same to you. We can talk at a different level.