Hacker News new | ask | show | jobs
by getnormality 476 days ago
The idea of minimizing complexity is less novel than it may seem. Regularization terms are commonly added to loss objectives in optimization, and these regularizers can often be interpreted as penalizing complexity. Duality allows us to interpret these objectives in multiple ways:

1. Minimize a weighted sum of data error and complexity.

2. Minimize the complexity, so long as the data error is kept below a limit.

3. Minimize the error on the data, so long as the complexity is kept below a limit.

It does seem like classical regularization of this kind has been out of fashion lately. I don't think it plays much of a role in most Transformer architectures. It would be interesting if it makes some sort of comeback.

Other than that, I think there are so many novel elements in this approach that it is hard to tell what is doing the work. Their neural architecture, for example, seems carefully hacked to maximize performance on ARC-AGI type tasks. It's hard to see how it generalizes beyond.

1 comments

Right, but see how the complexity budget is prescribed ahead of time: we first set the regularization strength or whatever, and then optimize the model. The result is the best model with complexity no greater than the budget. In this standard approach, we're not minimizing complexity, we're constraining it.
Again, because of duality, these are not really different things.

To your point in the other thread, once you start optimizing both data fidelity and complexity, it's no longer that different from other approaches. Regularization has been common in neural nets, but usually in a simple "sum of sizes of parameters" type way, and seemingly not an essential ingredient in recent successful models.

ah I see what you mean, sorry it took me a while. Yes, you're right, the two are dual: "fix complexity budget then optimize data error", and "fix data error budget then optimize cx".

I'm struggling to put a finger on it, but it feels that the approach in the blog post has the property that it finds the _minimum_ complexity solution, akin to driving the regularization strength in conventional ml higher and higher during training, and returning the solution at the highest such regularization that does not materially degrade the error (epislon in their paper). information theory plays the role of a measuring device that allows them to measure the error term and model complexity on a common scale, so as to trade them off against each other in training.

I haven't thought about it much but i've seen papers speculating that what happens in double-descent is finding lower complexity solutions.