|
|
|
|
|
by zamadatix
268 days ago
|
|
That (thankfully) can't compound, so would never be more than a one time offset. E.g. if you report a score of 60% SWE-bench verified for new model A, dumb A down to score 50%, and report a 20% improvement over A with new model B then it's pretty obvious when your last two model blogposts say 60%. The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release. |
|