| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorwba 39 days ago
	The METR task set contains no tasks with a duration greater than 32 hours (conservatively eyeballed from Figure 3: https://arxiv.org/abs/2503.17354 ), so any prediction that naively forecasts a longer time horizon is trivially incorrect. I guess that won't lead to a sigmoid-looking graph though, since METR will likely switch to a different evaluation methodology at that point and stop updating the old curve.

1 comments

METR themselves say that any estimate >16 is highly suspect because there are too few tasks.

I expect benchmarks like ProgramBench will replace METR this year.