Hacker News new | ask | show | jobs
by tbalsam 552 days ago
> Now that the AI research field is coming around to the idea that something beyond deep learning is needed,

I have not heard this from anyone that I work with! It would be a curious violation of info theory were this to be the case.

Certainly, some things cannot efficiently be learned from data. This is a case where some other kind of inductive bias or prior is needed (again, from info theory) -- but replacing deep learning entirely would be rather silly.

Part of the reason that a number of researchers don't take the benchmark more seriously is because it's meant to cripple the results. For example, in the name of reducing brute force search, the compute was severely limited! This turned many off to begin with. The general contention as I understand was to let compute be a reasonable amount, but this would not play well with the numbers game. Because if you restrict compute beyond a reasonable point, it makes the numbers artificially low for people who don't know what's going on behind the scenes. And this ends up biasing the results unreasonably to favor the original messaging, (i.e., "We need something other than deep learning.")

If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

Unfortunately due to that, plus the consistent goal-post moving of the benchmark is why it's generally not really held with staying power in the research community -- the messaging changes based upon what is convenient for publicity, and there's unfortunately been a history of similar things in the past in the pedigree leading up to the ARC prize itself.

It is not entirely unsalvageable, but there really needs to be a turnaround of how the competition and prize is managed in order to win back people's trust. Placing a thumb on the scales to confirm a prior bias/previous messaging may work for a little while, but over time it robs the metric of its usability over time as the greater research community loses trust.

2 comments

I think you’re overly fixated on some minor points relative to the overall utility on offer here. And also skewing the facts a bit. For example at one point you quote the OP on words that were never said as far as I can see. At another point, you characterize their position as “replacing deep learning entirely” which, as far as I can tell, has never been advocated for in this comment thread or on behalf of ARC.
That is an understandable statement, and probably fair as well I feel.

Much of this comes in reference to statements from fchollet w.r.t. replacing deep learning -- around the time of the initial prize, with a lot of the much more hype marketing, this was essentially the thru-line that was used, and it left a bitter taste in a number of peoples' mouths. W.r.t. misquoting, they did say that we needed something "beyond" deep learning, not "other than" here, and that is on me.

The utility is certainly still present, if I feel diminished, and it probably is a case of my own frustrations due to previous similar issues leading up to the ARC prize.

That being said, I do agree in retrospect that my response skewed from being objective -- it is a benchmark with a mixed history, but that doesn't mean that I should get personally caught up in it.

>> If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

The entire benchmark is set up so as to try and make it _artificially_ hard for deep learning: there are only three examples for each task; AND the private test set has a different distribution than the public training and validation sets (from what I can tell; a violation of PAC-Learning assumptions and then why should anyone be surprised if machine learning approaches in general can't deal with that?).

Even I (long story) find ARC to be unfair in the simplest sense of the word: it does not make for a level playing field that would allow for disparate approaches to machine learning to be compared fairly. Strangely and uniquely, the unfairness is aimed at the dominant approach, deep learning, where every other benchmark tends to skew towards deep learning (e.g. huge feature-based, labelled data).

But why's that? If ARC-AGI is a true test of AGI, or intelligence, or whatever it is supposed to be (an IQ test for AIs) then why does it have to jump through hoops just to defend itself from the dominant approach to AI? If it's a good test for AI, and the dominant approach to AI can't really do AI, then the dominant approach should not be capable of passing the test, without any shenanigans with reduced compute or few examples.

Is the purpose to demonstrate that deep neural nets can't generalise from few examples? That's machine learning 101 (although I guess there's still those who missed the lecture). Is it to encourage deep neural nets to get better at generalising from few examples? Well, first place just went to a big, deep, bad neural net with data augmentation so that doesn't even work.