Hacker News new | ask | show | jobs
by tadamcz 36 days ago
Nice work once again from Ofir Press and team; this seems to be an idea that's in the air.

> Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task

Fwiw, this is very different from what we find in MirrorCode:

> Opus 4.6 successfully reimplements almost every program up to gotree’s size in our benchmark.

https://epoch.ai/blog/mirrorcode-preliminary-results

I don't have time right now to dig in to what could explain the difference (I'm working hard on getting the full MirrorCode out as soon as possible). But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both.

I hope to look more into it after releasing MirrorCode, and write up my conclusions.

6 comments

Surely the biggest difference is that you guys are mostly testing LLMs on simpler utilities, mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).

Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal. In fact the only program you tested which actually has anywhere close to the complexity of SQLite or FFmpeg is is Pkl, and it looks like Opus 4.6 totally failed.

I think your results are consistent. You're just measuring different things. Your benchmarks mostly tests LLMs ability to write technically routine programs of moderate length - yes the bioinformatics package involves specialized domain knowledge, but not specialized Go engineering. ProgramBench is harder.

I don't think so. ProgramBench authors say no LLMs fully resolve any task, i.e. even the easiest tasks in their benchmark are unsolved. Whereas we found Opus 4.6 successfully reimplements almost every program up to gotree’s size (around 15-20 of them).

For Pkl, the preliminary results only went up to 1bn total tokens (costing $550, which would be cheap if LLMs could do the task). It might very well be solved at higher token budgets; see the report for more discussion of this.

The preliminary results are just on 4 targets. We have several Pkl-level and harder tasks in the full set which we're releasing soon.

In the following quote multiple things are not quite right:

> mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).

First, as I said above I think you're confusing the top-end of ProgramBench difficulty with the average. The quote in the OP is pretty clear that FFmpeg, SQLite, and PHP are the 3 hardest out of 200 in ProgramBench, and the bottom end is "compact CLI tools".

Second, I don't see the relevance of C vs higher-level languages, how does this make ProgramBench harder?

Third, for the test cases, I think you might be labouring under a misapprehension about how MirrorCode works? MirrorCode uses end-to-end tests from a variety of sources (the original program’s test suites, real-world data, and LLM-assisted generation). End-to-end means the stdout/stderr has to match exactly for each test case.

> Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal.

This is incidental to the main disagreement, but btw I also doubt this.

Let's try and make the claim more precise. e.g. are you saying the average university undergraduate studying CS would reimplement cal from scratch (only stdlib), matching the output perfectly for all 1365 MirrorCode test cases, in (say) 3 days of full-time work (without AI assistance obviously)? I'd bet against it!

Here is the manual for the cal that we use: https://media.githubusercontent.com/media/epoch-research/Mir...

You can also look at a full transcript of an LLM solving the task: https://epochai-public-eval-logs-manual.s3.amazonaws.com/eva...

The data is here: https://github.com/epoch-research/MirrorCode-data/

I didn't say "3 days of full-time work," that is totally unreasonable. I was giving them basically unlimited time to do whatever slow testing and research they needed. And let me qualify my statement: when I say "I would expect most sophomores to be able to do this," I mean "if most sophomores can't do this then their university is badly failing them." (If you want to split hairs about modern undergrads not learning C then I think this conversation is over.)

Of course it would take them a while to learn facts about datetime that the LLM doesn't need to learn. If your argument is about cost optimization then congrats, you win. The point is that it doesn't take a huge amount of C expertise to do this successfully - the standard implementation is nothing you wouldn't see in K&R: https://raw.githubusercontent.com/util-linux/util-linux/refs... It's routine.

But a nontrivial database, even a simple one like SQLite, really does require professional-level C expertise. It is not routine. So your comparison to ProgramBench still seems apple-to-oranges.

I think we're talking past each other here...
I would love to try this out. I have a horrible legacy project that is written in angular by a really amateur developer, full of huge blocks of copy pasted code that has minor modifications in each block. I’ve tried before to get an LLM to rewrite it to something more sensible, but I have not succeeded, usually it just ends up breaking everything. Is there a guide or some system to follow? What’s the best way to accomplish a task like this?
I think one way is to take the existing system in something like a docker container or equivalent, some kind of black box, and write tests against it in pure HTTP calls or using browser automation to record (can drive it with AI). When you've reached a truly massive test suite that covers everything, you delete the container and use the test suite as an oracle for writing a new version (open book, the AI can look at the test suite but not change it).

This is a tactic based on things I have read in "Working Effectively with Legacy Code" by Michael Feathers - he discusses using cut points to build a testing firewall to bring code under test, then gradually expanding the test suite from that beachhead of confirmed interface.

I've been very successful so far using Sonnet 4.6 (1M) as the basic model in Claude Code, plus Codex and gemini-review plugins for second/third opinions. (The last one is somewhat busted and hardcoded old gemini versions, I should patch it up.)

I needed to use Opus 4.7 for one project because it used very recent APIs, and it certainly is smart but it's also very expensive.

Normal engineering practices as thought since the 70ies.

Break the problems up into manageable pieces. Make a plan, have tests to verify the outcome, implement that part. Rinse and repeat. Have integration tests.

I have an approach that can handle this if you're interested? My email is in my profile.
Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning.

Same with SWE-bench and others.

I agree it's a potentially big problem, affecting almost any benchmark out there. We discuss it briefly in "Appendix A: Contamination and memorization" https://epoch.ai/blog/mirrorcode-preliminary-results#appendi....

Ideally one would do these benchmarks with held-out proprietary software, but that comes with many practical concerns.

That's a feature not a bug. It doesn't make benchmarking any more meaningful or simple, but being trained to recall patterns is a legitimate goal for a coding agent.
Yes but then the benchmarks need to be presented as "this verifies whether the model can recall this exact same situation and does not actually benchmark any reasoning at all".

This is not the case, they're being presented as "how good is the model at software engineering". E.g. the benchmark in question says this:

"Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. "

When your benchmark is fundamentally embedded extremely well in the training data, such that you're actually just benchmarking "how well do you remember what sqlite looks like" rather than "do you understand all the tradeoffs, risks, design decisions that need to be made to build a bespoke database from scratch".

This is a VERY big caveat that, to me, for a decent part explains the discrepancy between benchmarks and reality.

Is anyone familiar with gotree? That was mentioned as the most complex piece of code, but the metric was LOC. Based on the high level description gotree might be closer to a set of small programs / algorithms.

Interesting anyway. It will be nice to see these comparisons with open weight models and how do those fare.

There's a more detailed description in "Appendix B: Qualitative discussion of the gotree task"

https://epoch.ai/blog/mirrorcode-preliminary-results#appendi...

I should say one big difference is ProgramBench has 200 target programs while MirrorCode has about 30. We did many manual things to ensure task quality, that would have required huge resources to do at ProgramBench scale.
"But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both."

I'd go with "impossible":

"Given a gold (reference) executable and its usage documentation, a task worker is asked to write source code and a build script that constructs a candidate executable which should reproduce the behavior of the gold executable."

The test cases are built from an AI doing an examination of the source code and producing test cases, and later text also confirms that the AI during the production phase can't read the original executable so it can't reverse engineer it directly, so the test cases are being drawn from a situation where the tester has vastly more knowledge of the program than the implenter.

That is a losing scenario for anyone, be they human, modern AI, or even some hypothetical perfect programmer. Take ffmpeg as an extreme example. The documentation does not even remotely specify the program. Entire codecs can be missed at a stroke, and each of those codecs is itself a rich set of features that may or may not be used in a given input or output file, but the final tests can freely draw from any of those things. And trying to implement a codec from just some input and output would strain anyone, especially when the input is all but certain to not be sufficiently broad to make the determination for sure.

That sort of issue extends all the way down to even some tiny command-line programs I've written myself. The end-user documentation is never a specification. That's not what end-user documentation is. And even if you did hand the AI all relevant specifications you'd still get an implementation of the specification, but anyone who has ever implemented a non-trivial specification into real-world situations can tell you all about how even the spec is never enough.

I think that's an absolutely ridiculous test. If you handed to me as a human I would simply refuse because I'd tell you straight up front that it is plainly obvious I'm going to utterly and completely fail, so why even bother with the time to try?