|
|
|
|
|
by bogtog
512 days ago
|
|
A lot of the comments express some type of deliberate cheating the benchmark. However, even without intentionally trying to game it, if anybody can repeatedly take the same test, then they'll be nudged to overfit/p-hack. For instance, suppose they conduct an experiment and find that changing some hyper-parameter yields a 2% boost. That could just be noise, it could be a genuine small improvement, or it may be a mix of a genuine boost along with some fortunate noise. An effect may be small enough that researchers would need to rely on their gut to interpret it. Researchers may jump on noise while believing they have discovered true optimizations. Enough of these types of nudges, and some serious benchmark gains can materialize. (Hopefully my comment isn't entirely misguided, I don't know how they actually do testing or how often they probe their test set) |
|