| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thomashop 46 days ago
	i thought the agent can execute real ffmpeg to compare

2 comments

killerstorm 46 days ago

I think you underestimate complexity of audio & video encoding standards. There are hundreds and hundreds of pages of specification. How many times do you need to execute real ffmpeg to get all tiny details?

It's certainly possible to reverse-engineer it from a blackbox access, but it would take *years* and this test has a time limit.

link

astrange 46 days ago

ffmpeg also includes many formats with no standards that were reverse-engineered in the first place.

link

GorbachevyChase 46 days ago

Even given that I think solving the problem would require a certain amount of personal agency and volition to drive useful experimentation, and then you still have an inescapable problem that a design process is never verifiably done; it just a sense of taste when a product is good enough and it’s time to stop working on it.

I’m not sure this benchmark is even very interesting because it requires a language model do something that it really cannot do. Maybe it would be possible with a novel harness in an ensemble system, but I would never expect a pure language model that is run in a minimal harness to ever be able to do this.

link