Hacker News new | ask | show | jobs
by cocomutator 474 days ago
Right, but see how the complexity budget is prescribed ahead of time: we first set the regularization strength or whatever, and then optimize the model. The result is the best model with complexity no greater than the budget. In this standard approach, we're not minimizing complexity, we're constraining it.
1 comments

Again, because of duality, these are not really different things.

To your point in the other thread, once you start optimizing both data fidelity and complexity, it's no longer that different from other approaches. Regularization has been common in neural nets, but usually in a simple "sum of sizes of parameters" type way, and seemingly not an essential ingredient in recent successful models.

ah I see what you mean, sorry it took me a while. Yes, you're right, the two are dual: "fix complexity budget then optimize data error", and "fix data error budget then optimize cx".

I'm struggling to put a finger on it, but it feels that the approach in the blog post has the property that it finds the _minimum_ complexity solution, akin to driving the regularization strength in conventional ml higher and higher during training, and returning the solution at the highest such regularization that does not materially degrade the error (epislon in their paper). information theory plays the role of a measuring device that allows them to measure the error term and model complexity on a common scale, so as to trade them off against each other in training.

I haven't thought about it much but i've seen papers speculating that what happens in double-descent is finding lower complexity solutions.