Hacker News new | ask | show | jobs
by thomashop 46 days ago
i thought the agent can execute real ffmpeg to compare
2 comments

I think you underestimate complexity of audio & video encoding standards. There are hundreds and hundreds of pages of specification. How many times do you need to execute real ffmpeg to get all tiny details?

It's certainly possible to reverse-engineer it from a blackbox access, but it would take *years* and this test has a time limit.

ffmpeg also includes many formats with no standards that were reverse-engineered in the first place.
Even given that I think solving the problem would require a certain amount of personal agency and volition to drive useful experimentation, and then you still have an inescapable problem that a design process is never verifiably done; it just a sense of taste when a product is good enough and it’s time to stop working on it.

I’m not sure this benchmark is even very interesting because it requires a language model do something that it really cannot do. Maybe it would be possible with a novel harness in an ensemble system, but I would never expect a pure language model that is run in a minimal harness to ever be able to do this.