|
|
|
|
|
by al_borland
64 days ago
|
|
Are there any good ways to benchmark models over time that don't fall victim to Goodhart's law? It seems that once the benchmark is defined, the AI will train on it, and it will become effectively meaningless. I read many articles about AIs doing extremely well on various tests in graduate or PhD level programs. But these tests are well defined. A professor put the same models though his freshman CS class and most of them failed. |
|