APPS has 3 subsets by difficulty level: introductory, interview, and competition. It isn't clear which subset Claude 3 was benchmarked on. Even if it is just "introductory" it is still pretty good, but it would be good to know.
Since they don’t state it, does it mean they tested it on the whole test set?
If that’s the case, and we assume for simplicity that Opus solves all Intro problems and none of the Competition problems, it’d have solved 83%+ of the Interview level problems.
(There are 1000/3000/1000 problems in the test set in each level).
It’d be great if someone from Anthropic provides an answer though.
(There are 1000/3000/1000 problems in the test set in each level).
It’d be great if someone from Anthropic provides an answer though.