| "But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both." I'd go with "impossible": "Given a gold (reference) executable and its usage documentation, a task worker is asked to write source code and a build script that constructs a candidate executable which should reproduce the behavior of the gold executable." The test cases are built from an AI doing an examination of the source code and producing test cases, and later text also confirms that the AI during the production phase can't read the original executable so it can't reverse engineer it directly, so the test cases are being drawn from a situation where the tester has vastly more knowledge of the program than the implenter. That is a losing scenario for anyone, be they human, modern AI, or even some hypothetical perfect programmer. Take ffmpeg as an extreme example. The documentation does not even remotely specify the program. Entire codecs can be missed at a stroke, and each of those codecs is itself a rich set of features that may or may not be used in a given input or output file, but the final tests can freely draw from any of those things. And trying to implement a codec from just some input and output would strain anyone, especially when the input is all but certain to not be sufficiently broad to make the determination for sure. That sort of issue extends all the way down to even some tiny command-line programs I've written myself. The end-user documentation is never a specification. That's not what end-user documentation is. And even if you did hand the AI all relevant specifications you'd still get an implementation of the specification, but anyone who has ever implemented a non-trivial specification into real-world situations can tell you all about how even the spec is never enough. I think that's an absolutely ridiculous test. If you handed to me as a human I would simply refuse because I'd tell you straight up front that it is plainly obvious I'm going to utterly and completely fail, so why even bother with the time to try? |