|
|
|
|
|
by yorwba
39 days ago
|
|
The METR task set contains no tasks with a duration greater than 32 hours (conservatively eyeballed from Figure 3: https://arxiv.org/abs/2503.17354 ), so any prediction that naively forecasts a longer time horizon is trivially incorrect. I guess that won't lead to a sigmoid-looking graph though, since METR will likely switch to a different evaluation methodology at that point and stop updating the old curve. |
|
I expect benchmarks like ProgramBench will replace METR this year.