Y
Hacker News
new
|
ask
|
show
|
jobs
by
raincole
338 days ago
I don't know the details (of course, it's unreleased), but note that MathArena evaluated "
average
of 4 attempts", and limited token usages to 64k.
OpenAI likely had unlimited tokens, and evaluated "
best
of N attempts."