| (ARC Prize co-founder here). Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea: > get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s) Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of. A couple important notes: 1. this result is on the public eval set vs private set (ARC Prize $). 2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet. All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this |
> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set
and
> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing
It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set