Hacker News new | ask | show | jobs
by adamviola 530 days ago
Most benchmarks contain a bunch of examples of a particular task - e.g., each example in an image classification benchmark is an image and its associated class. The approach for doing well on these types of benchmarks has historically been (1) train a large model with (2) lots of data. However, each item in the ARC benchmark is totally unique task. The network is presented a handful of examples (questions and answers) of the unique task and is asked to complete one instance of the task. Importantly, the tasks are a secret. The only way that models can “prepare” for ARC is by getting familiar with the public priors of the ARC tasks - e.g., the colored grid world. As a result, ARC evaluates the ability of models to learn new tasks with limited data at test time. This is a thing humans do very well that models do not (at least up until now).