|
|
|
|
|
by daveguy
82 days ago
|
|
The purpose is to benchmark both generality and intelligence. "Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. There's a ceiling based on how consistent the performance is across all tasks. |
|
Really ? This happens plenty with human testing. Humans aren't general ?
The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.
I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.