Hacker News new | ask | show | jobs
by jpdus 954 days ago
For other (non-code) benchmarks, people are having the opposite experience:

"I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the following results, here a SAT-like test:

- GPT3.5 - 690 (10 wrong) - GPT4 - 770 (3 wrong) - GPT4-turbo (one section at time) - 740 (5 wrong) - GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)"

Source: https://twitter.com/wangzjeff/status/1721934560919994823?t=P...

4 comments

Does anybody know if 2008-2009 SAT is in the training set for these models? Assuming so, I’d be especially interested in head-to-head evals on this type of non-code benchmark for problem sets not already in the training data, to see how it performs on fresh situations.
Probably not a statistically significant difference there.
N=1
What did you mean by "opposite"?

You seem to be suggesting it got a bit worse, and the aider article seems to suggest gpt4 got a bit worse, although much faster at being a bit worse, while gpt3.5 got worse, then better, while faster.

The Aider article has been updated with the complete results. Previously Turbo was leading slightly. So far any difference is in the noise.

However, in my opinion the first attempt score is more important, and Turbo does genuinely seem to lead there. There's still a possibility the updated training data has tainted the results.