| Scoring 96 percentile among humans taking the exam without moving goal posts would have been science fiction two years ago. Now it’s suddenly not good enough and the fact a computer program can score decent among passing lawyers and first time test takers is something to sneer at. The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking. Anyone who views it as anything less in 2024 and asserts with a straight face they wouldn’t have said the same thing in 2020 is lying. I do however find the paper really useful in contextualizing the scoring with a much finer grain. Personally I didn’t take the 96 percentile score to be anything other than “among the mass who take the test,” and have enough experience with professional licensing exams to know a huge percentage of test takers fail and are repeat test takers. Placing the goal posts quantitatively for the next levels of achievement is a useful exercise. But the profusion of jaded nerds makes me sad. |
Are we sure these exams are not present in the training data? (ability to recall information is not impressive for a computer)
Still I'm terrible at many many tasks e.g., drawing from description and the models widen significantly types of problems that I can even try (where results can be verified easily, and no precision is required)