| Its pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure. There will always be an incentive to optimize specifically for these benchmarks even if just for marketing material. Sure there is a training cutoff, but its usually only 3-6 months off of the public release dates. The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks. In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line. |
https://github.com/mnky9800n/zork-bench