| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mikeknoop 566 days ago

Author here -- six months ago we launched ARC Prize, a huge $1M experiment, to test if we need new ideas for AGI. The ARC-AGI benchmark remains unbeaten and I think we can now definitely say "yes".

One big update since June is that progress is no longer stalled. Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI. The fundamental architecture of these systems hasn't changed since ~2019.

But this flipped late summer. AlphaProof and o1 are evidence of this new reality. All frontier AI systems are now incorporating components beyond pure deep learning like program synthesis and program search.

I believe ARC Prize played a role here too. All the winners this year are leveraging new AGI reasoning approaches like deep-learning guided program synthesis, and test-time training/fine-tuning. We'll be seeing a lot more of these in frontier AI systems in coming years.

And I'm proud to say that all the code and papers from this year's winners are now open source!

We're going to keep running this thing annually until its defeated. And we've got ARC-AGI-2 in the works to improve on several of the v1 flaws (more here: https://arcprize.org/blog/arc-prize-2024-winners-technical-r...)

The ARC-AGI community keeps surprising me. From initial launch, through o1 testing, to the final 48 hours when the winning team jumped 10% and both winning papers dropped out of nowhere. I'm incredibly grateful to everyone and we will do our best to steward this attention towards AGI.

We'll be back in 2025!

5 comments

tbalsam 566 days ago

As a rather experienced ML researcher, ARC is a great benchmark on its own, but is punching below its weight in terms of claiming that it is a gate (or in terms of this post -- a "steward") towards AGI, and in my perspective and the perspective of several researchers near me this has watered down the value of the ARC benchmark as a test.

It is a great unit test for reasoning -- that's fantastic! And maybe it is indeed the best way to test for this -- who knows exactly. But the claim is a little grandiose for what it is, this is somewhat similar to saying that testing on string parity is the One True Test for testing an optimizer's efficiency.

I'd heartily recommend maybe taking down the marketing vibrance down a notch and keep things a bit more measured, it's not entirely a meme, though some of the more-serious researchers don't take it as seriously as a result. And that's the kind of people that you want to attract to this sort of thing!

I think there is a potentially good future for ARC! But it might struggle to attract some of the kind of talent that you want to work on this problem as a result.

mikeknoop 566 days ago

> I'd heartily recommend maybe taking down the marketing vibrance down a notch and keep things a bit more measured, it's not entirely a meme, though some of the more-serious researchers don't take it as seriously as a result.

This is fair critique. ARC Prize's 2024 messaging was sharp to break through the noise floor -- ARC has been around since 2019 but most only learned about it this summer. Now that it has garnered awareness, it is no longer useful, and in same cases hurting progress like you point out. The messaging needs to evolve and mature next year to be more neutral/academic.

tbalsam 566 days ago

I feel rather consternated that this response effectively boils down to "yes, we know we overhyped this to get people's attention, and now that we have it we can be more honest about it". Fighting for place in the attention economy is understandable, being deceptive about it is not.

This is part of the ethical morass of why some more serious researchers aren't touching the benchmark. People are not going to take it seriously if it continues like this!

mikeknoop 566 days ago

I think we agree; to clarify, sharp messaging isn't inaccurate messaging. And I believe the story is not overhyped given the evidence: the benchmark resisted a $1M prize pool for ~6 months. But I concede we did obsess about the story to give it the best chance of survival in the marketplace of ideas against the incumbent AI research meme (LLM scaling). Now that the AI research field is coming around to the idea that something beyond deep learning is needed, the story matters less, and the benchmark, and future versions, can stand on their utility as a compass towards AGI.

mrandish 566 days ago

Mike - please know that not everyone who appreciates ARC feels the same way as the GP. I'm not an academic researcher but I am quite sensitive to hype and excessive marketing. I've never felt the ARC site was anything other than appropriately professional.

Even revisiting it now, I don't see anything wrong with being concisely clear and even a little provocative in stating your case on your own site. Especially since a key value of ARC is getting more objectively grounded regarding progress toward AGI. On top of that ARC is "A non-profit for the public advancement of open artificial general intelligence" that you guys are personally donating serious money and time to that's helping a field where a lot of entrepreneurs are going to make money and academics are going to advance their careers.

My perception is ARC tried it the other way for years but a lot of academics and AI pundits ignored or dismissed it without ever meaningfully engaging with it. "Sharpening" the message this year has clearly paid off in bringing attention that's shifted the conversation and is helping advance progress toward AGI in ways nothing else has. I also greatly appreciate the time and care you and Francois have put into making the ARC proposition clear enough for non-technical people to understand. That's hard to do and doesn't happen by accident.

Personally, I've found ARC valuable in the real world outside of academia and domain experts because it provides a conceptually simple starting place to discuss with non-technical people what the term AGI might even mean. My high school-aged daughter asked me about vague AGI impending doom scenarios she heard on TikTok. I had her solve a couple ARC samples and then pointed out that today's best AIs aren't yet close to doing the same. This counter-intuitive revelation got her pondering the "Why?" which led to a deep discussion about the multi-dimensional breadth of human creativity and an appreciation of the many ways artificial intelligences might differ from human intelligence.

YeGoblynQueenne 566 days ago

>> My perception is ARC tried it the other way for years but a lot of academics and AI pundits ignored or dismissed it without ever meaningfully engaging with it.

Your perception is very wrong and the likely reason is that as you say you're not an academic researcher. ARC made a huge splash with the original Kaggle competition a few years ago and it drew in exactly the kind of "academic researcher" you seem to be pointing to: those in university research groups who do not have access to the data and compute that the big tech companies have, and who can consequently not compete in the usual big data benchmarks that are dominated by Google, OpenAI, Meta, and friends. ARC, with its (unfair) few-shot tasks and constantly changing private test set, is exactly the kind of dataset that that kind of researcher are looking for, something that is relatively safe from big tech deep neural nets. Even the $1 million prize seems specially designed to be just enough to draw in that crowd of not super-rich academics while leaving corporate research groups insufficiently motivated.

Besides which, I won't name names but one of the principal researchers in the winning system is just one of those academics. I don't know which is the period you mean ARC was ignored by the academic community but that particular researcher was in a certain meeting of like-minded academics two years ago where one of the main areas of discussion was in short "how to beat ARC and show that our stuff works".

YeGoblynQueenne 566 days ago

>> Now that the AI research field is coming around to the idea that something beyond deep learning is needed, the story matters less, and the benchmark, and future versions, can stand on their utility as a compass towards AGI.

How so? All the three top systems are deep neural net systems. The first place went to a system that, quoting from the "contributions" section of the paper, employed:

>> An automated data generation methodology that starts with 100-160 program solutions for ARC training tasks, and expands them to make 400k new problems paired with Python solutions

As I pointed out in another comment the top results in ARC have been achieved by ordinary, deep-learning, big-data, memorisation based approaches. You and fchollet (in these comments) try to claim otherwise but I don't understand why.

In fact, no, I understand why. I think fchollet wanted to place ARC as "not just a benchmark", the opposite of what tbalsam is asking for above. The motivation is solid: if we've learned anything in the last twenty-thirty years is that deep neural nets are very capable at beating benchmarks. For any deep neural net model that beats a benchmark though the question remains whether it can do anything else besides. Unfortunately, that is not a question that can be answered by beating yet another benchmark.

And here we are now, and the first place in the current ARC challenge goes to a deep neural net system trained on a synthetically augmented dataset. The right thing to do now would be to scale back the claims about the magickal AGI-IQ test with unicorns, and accept that your benchmark is just not any different than any other previous AI benchmark, that it is not any more informative than any other benchmark, and that a completely different kind of test of artificial intelligence is needed.

There is after all such a thing as scientific integrity. You make a big conjecture, you look at the data, realise that you're wrong, accept it, and move on. For example the authors of GLUE did that (as in SUPERGLUE). The authors of the Winograd Schema Challenge did that. You should follow their examples.

trott 565 days ago

> realise that you're wrong, accept it, and move on

What do you think about limiting the submission size? Kaggle does this sometimes.

With a limit like 0.1-1MB (compressed), you are basically saying: "Give me sample-efficient learning algorithms, not pretrained models."

tbalsam 566 days ago

> Now that the AI research field is coming around to the idea that something beyond deep learning is needed,

I have not heard this from anyone that I work with! It would be a curious violation of info theory were this to be the case.

Certainly, some things cannot efficiently be learned from data. This is a case where some other kind of inductive bias or prior is needed (again, from info theory) -- but replacing deep learning entirely would be rather silly.

Part of the reason that a number of researchers don't take the benchmark more seriously is because it's meant to cripple the results. For example, in the name of reducing brute force search, the compute was severely limited! This turned many off to begin with. The general contention as I understand was to let compute be a reasonable amount, but this would not play well with the numbers game. Because if you restrict compute beyond a reasonable point, it makes the numbers artificially low for people who don't know what's going on behind the scenes. And this ends up biasing the results unreasonably to favor the original messaging, (i.e., "We need something other than deep learning.")

If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

Unfortunately due to that, plus the consistent goal-post moving of the benchmark is why it's generally not really held with staying power in the research community -- the messaging changes based upon what is convenient for publicity, and there's unfortunately been a history of similar things in the past in the pedigree leading up to the ARC prize itself.

It is not entirely unsalvageable, but there really needs to be a turnaround of how the competition and prize is managed in order to win back people's trust. Placing a thumb on the scales to confirm a prior bias/previous messaging may work for a little while, but over time it robs the metric of its usability over time as the greater research community loses trust.

WhitneyLand 566 days ago

I think you’re overly fixated on some minor points relative to the overall utility on offer here. And also skewing the facts a bit. For example at one point you quote the OP on words that were never said as far as I can see. At another point, you characterize their position as “replacing deep learning entirely” which, as far as I can tell, has never been advocated for in this comment thread or on behalf of ARC.

YeGoblynQueenne 566 days ago

>> If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

The entire benchmark is set up so as to try and make it _artificially_ hard for deep learning: there are only three examples for each task; AND the private test set has a different distribution than the public training and validation sets (from what I can tell; a violation of PAC-Learning assumptions and then why should anyone be surprised if machine learning approaches in general can't deal with that?).

Even I (long story) find ARC to be unfair in the simplest sense of the word: it does not make for a level playing field that would allow for disparate approaches to machine learning to be compared fairly. Strangely and uniquely, the unfairness is aimed at the dominant approach, deep learning, where every other benchmark tends to skew towards deep learning (e.g. huge feature-based, labelled data).

But why's that? If ARC-AGI is a true test of AGI, or intelligence, or whatever it is supposed to be (an IQ test for AIs) then why does it have to jump through hoops just to defend itself from the dominant approach to AI? If it's a good test for AI, and the dominant approach to AI can't really do AI, then the dominant approach should not be capable of passing the test, without any shenanigans with reduced compute or few examples.

Is the purpose to demonstrate that deep neural nets can't generalise from few examples? That's machine learning 101 (although I guess there's still those who missed the lecture). Is it to encourage deep neural nets to get better at generalising from few examples? Well, first place just went to a big, deep, bad neural net with data augmentation so that doesn't even work.

iwsk 566 days ago

we live in a society

padswo1 566 days ago

I don’t think ARC has particularly advanced the research. The approaches that are successful were developed elsewhere and then applied to ARC. Happy to be shown somewhere this is not the case.

In the case of TTT, I wouldn’t really describe that as a ‘new AGI reasoning approach’. People have been fine tuning deep learning models on specific tasks for a long time.

The fundamental instinct driving the creation of ARC - that ‘deep learning cannot do system 2 thinking’, is under threat of being proven wrong very soon. Attempts to define the approaches that are working as somehow not ‘traditional deep learning’ really seem like shifting the goal posts.

mikeknoop 566 days ago

Correct, fine-tuning is not new. It's long been used to augment foundational LLMs with private data. Eg. private enterprise data. We do this at Zapier, for instance.

The new and surprising thing about test-time training (TTT) is how effective it is an approach to deal with novel abstract reasoning problems like ARC-AGI.

TTT was pioneered by Jack Cole last year and popularized this year by several teams, including this winning paper: https://ekinakyurek.github.io/papers/ttt.pdf

p1esk 566 days ago

How is TTT anything other than a deep learning algorithm? We have a deep learning model, we generate training data based on an example and use a stochastic gradient descent to update the model weights to improve its predictions according to the training data. This is a classic DL paradigm. I just don’t see why would you consider this an advancement if you your goal is to move “beyond” deep learning.

mrandish 566 days ago

Congrats to you and Francois on the success of ARC-AGI 24 and thanks so much for doing it. I just finished the technical report and am encouraged! It's great to finally see some tangible progress in research that is both novel and plausibly in fruitful directions.

trott 566 days ago

Mike and François,

Compute is limited during inference, and this naturally limits brute-force program search.

But this doesn't prevent one from creating a huge ARC-like dataset ahead of time, like BARC did (but bigger), and training a correspondingly huge NN on it.

Placing a limit on the submission size could foil this kind of brute-force approach though. I wonder if you are considering this for 2025?

nerdponx 565 days ago

> Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI.

Was it? What did the "public" consist of exactly?