|
|
|
|
|
by jpdus
954 days ago
|
|
For other (non-code) benchmarks, people are having the opposite experience: "I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the following results, here a SAT-like test: - GPT3.5 - 690 (10 wrong)
- GPT4 - 770 (3 wrong)
- GPT4-turbo (one section at time) - 740 (5 wrong)
- GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)" Source: https://twitter.com/wangzjeff/status/1721934560919994823?t=P... |
|