Hacker News new | ask | show | jobs
by cocomutator 476 days ago
I'm trying to distill the essence of their approach, which imho is concealed behind inessential and particular details such as the choice of this or that compression scheme or prior distributions.

It seems like the central innovation is the construction of a "model" which can be optimized with gradient descent, and whose optimum is the "simplest" model that memorizes the input-output relationships. In their setup, "simplest" has the concrete meaning of "which can be efficiently compressed" but more generally it probably means something like "whose model complexity is lowest possible".

This is in stark contrast to what happens in standard ML: typically, we start by prescribing a complexity budget (e.g. by choosing the model architecture and all complexity parameters), and only then train on data to find a good solution that memorizes input-output relationship.

The new method is ML on its head: we optimize the model so that we reduce its complexity as much as possible while still memorizing the input-output pairs. That this is able to generalize from 2 training examples is truly remarkable and imho hints that this is absolutely the right way of "going about" generalization.

Information theory happened to be the angle from which the authors arrived at this construction, but I'm not sure that is the essential bit. Rather, the essential bit seems to be the realization that rather than finding the best model for a fixed pre-determined complexity budget, we can find models with minimal possible complexity.

2 comments

The idea of minimizing complexity is less novel than it may seem. Regularization terms are commonly added to loss objectives in optimization, and these regularizers can often be interpreted as penalizing complexity. Duality allows us to interpret these objectives in multiple ways:

1. Minimize a weighted sum of data error and complexity.

2. Minimize the complexity, so long as the data error is kept below a limit.

3. Minimize the error on the data, so long as the complexity is kept below a limit.

It does seem like classical regularization of this kind has been out of fashion lately. I don't think it plays much of a role in most Transformer architectures. It would be interesting if it makes some sort of comeback.

Other than that, I think there are so many novel elements in this approach that it is hard to tell what is doing the work. Their neural architecture, for example, seems carefully hacked to maximize performance on ARC-AGI type tasks. It's hard to see how it generalizes beyond.

Right, but see how the complexity budget is prescribed ahead of time: we first set the regularization strength or whatever, and then optimize the model. The result is the best model with complexity no greater than the budget. In this standard approach, we're not minimizing complexity, we're constraining it.
Again, because of duality, these are not really different things.

To your point in the other thread, once you start optimizing both data fidelity and complexity, it's no longer that different from other approaches. Regularization has been common in neural nets, but usually in a simple "sum of sizes of parameters" type way, and seemingly not an essential ingredient in recent successful models.

ah I see what you mean, sorry it took me a while. Yes, you're right, the two are dual: "fix complexity budget then optimize data error", and "fix data error budget then optimize cx".

I'm struggling to put a finger on it, but it feels that the approach in the blog post has the property that it finds the _minimum_ complexity solution, akin to driving the regularization strength in conventional ml higher and higher during training, and returning the solution at the highest such regularization that does not materially degrade the error (epislon in their paper). information theory plays the role of a measuring device that allows them to measure the error term and model complexity on a common scale, so as to trade them off against each other in training.

I haven't thought about it much but i've seen papers speculating that what happens in double-descent is finding lower complexity solutions.

I think you're right about the essential ingredient in this finding, but I feel like this is a pretty ARC-AGI specific result.

Each puzzle is kind of a similar format, and the data that changes in the puzzle is almost precisely that needed to deduce the rule. By reducing the amount of information needed to describe the rule, you almost have to reduce your codec to what the rule itself is doing - to minimise the information loss.

I feel like if there was more noise or arbitrary data in each puzzle, this technique would not work. Clearly there's a point at which that gets difficult - the puzzle should not be "working out where the puzzle is" - but this only works because each example is just pure information with respect to the puzzle itself.

I agree with your observation about the exact noise-free nature of the problem. It allows them to formulate the problem as "minimize complexity such that you memorize the X-y relationship exactly". This would need to be generalized to the noisy case: instead of demanding exact memorization, you'd need to prescribe an error budget. But then this error budget seems like an effective complexity metaparameter, doesn't it, and we're back to square zero of cross-validation.
If we think of the 'budget' as being similar to a bandwidth limit on video playback, there's a kind of line below which the picture starts being pretty unintelligible, but for the most part that's a slider: the less the budget, the slightly less accurate playback you get.

But because this is clean data, I wonder if there's basically a big gap here: the codec that encodes the "correct rule" can achieve a step-change lower bandwidth requirement than similar-looking solutions. The most elegant ruleset - at least in this set of puzzles - always compresses markedly better. And so you can kind of brute-force the correct rule by trying lots of encoding strategies, and just identify which one gets you that step-change compression benefit.