|
|
|
|
|
by vishal0123
1192 days ago
|
|
From the paper > We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions
removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C. |
|