| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jpdus 954 days ago

For other (non-code) benchmarks, people are having the opposite experience:

"I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the following results, here a SAT-like test:

- GPT3.5 - 690 (10 wrong) - GPT4 - 770 (3 wrong) - GPT4-turbo (one section at time) - 740 (5 wrong) - GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)"

Source: https://twitter.com/wangzjeff/status/1721934560919994823?t=P...

4 comments

dazzaji 954 days ago

Does anybody know if 2008-2009 SAT is in the training set for these models? Assuming so, I’d be especially interested in head-to-head evals on this type of non-code benchmark for problem sets not already in the training data, to see how it performs on fresh situations.

link

rafaelero 954 days ago

Probably not a statistically significant difference there.

link

exo-pla-net 954 days ago

N=1

link

Terretta 954 days ago

What did you mean by "opposite"?

You seem to be suggesting it got a bit worse, and the aider article seems to suggest gpt4 got a bit worse, although much faster at being a bit worse, while gpt3.5 got worse, then better, while faster.

link

reitzensteinm 954 days ago

The Aider article has been updated with the complete results. Previously Turbo was leading slightly. So far any difference is in the noise.

However, in my opinion the first attempt score is more important, and Turbo does genuinely seem to lead there. There's still a possibility the updated training data has tainted the results.

link