|
|
|
|
|
by godelski
655 days ago
|
|
> An ideally trained network could, in principle, learn the data-generating program
No disagreement > I might have a NN that naively looks like it takes up GBs of space, but it might actually be parameterizing a much simpler function (hence our ability to prune/compress the weights without performance loss - most of the capacity wasn't being used for any interesting computation).
Also no disagreement.I suggested that this probably isn't the case here since they tried distillation and saw no effect. While this isn't proof that this particular model can't be compressed more it does suggest that it is non-trivial. This is especially true given the huge difference in size. I mean we're talking about 700x... Where I think our disagreement is in that I read the OP as saying __this__ network. If we're talking about a theoretical network, well... nothing I said anywhere is in any disagreement with that. I even said in the post I linked to that the difference shows that there's still a long way to go but that this is still cool. Why did I assume OP was talking about __this__ network? Well because we're in a thread talking about a paper and well... yes, we're talking about compression machines so theoretically (well not actually supported by any math theory) this is true for so many things and that is a bit elementary. So makes more sense (imo) that we're talking about this network. And I wanted to make it clear that this network is nowhere near compression. Can further research later result in something that is better than the source code? Who knows? For all the reasons we've both mentioned. We know they are universal approximators (which are not universal mimicers and have limits) but we have no guarantee of global convergence (let alone proof such a thing exists in many problems). And I'm not sure why you're trying to explain the basic concepts to me. I mentioned I was an ML researcher. I see you're a PhD at Oxford. I'm sure you would be annoyed if I was doing the same to you. We can talk at a different level. |
|
I agree with you that this network probably has not found the source code or something like a minimal description in its weights.
Honestly, I'm writing a paper on model compression/complexity right now, so I may have co-opted the discussion to practice talking about these things...! Just a bit over-eager (,,>﹏<,,)
Have you given much thought to how we can encourage models to be more compressible? I'd love to be able to explicitly penalize the filesize during training, but in some usefully learnable way. Proxies like weight norm penalties have problems in the limit.