|
|
|
|
|
by gruez
593 days ago
|
|
>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks. Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models". |
|