|
|
|
|
|
by famouswaffles
79 days ago
|
|
>"Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. Really ? This happens plenty with human testing. Humans aren't general ? The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology. I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities. |
|
Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.
For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks:
https://arxiv.org/abs/1911.01547