| Nice work once again from Ofir Press and team; this seems to be an idea that's in the air. > Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task Fwiw, this is very different from what we find in MirrorCode: > Opus 4.6 successfully reimplements almost every program up to gotree’s size in our benchmark. https://epoch.ai/blog/mirrorcode-preliminary-results I don't have time right now to dig in to what could explain the difference (I'm working hard on getting the full MirrorCode out as soon as possible). But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both. I hope to look more into it after releasing MirrorCode, and write up my conclusions. |
Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal. In fact the only program you tested which actually has anywhere close to the complexity of SQLite or FFmpeg is is Pkl, and it looks like Opus 4.6 totally failed.
I think your results are consistent. You're just measuring different things. Your benchmarks mostly tests LLMs ability to write technically routine programs of moderate length - yes the bioinformatics package involves specialized domain knowledge, but not specialized Go engineering. ProgramBench is harder.