Y
Hacker News
new
|
ask
|
show
|
jobs
by
winton
143 days ago
So if I try to do it with Opus three or four times, I'll get it done? And probably in about 10 minutes? Awesome
2 comments
stared
143 days ago
Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved.
See e.g.:
https://quesma.com/benchmarks/otel/models/claude-opus-4.5/
link
throwup238
143 days ago
That’s only if the failures are truly random and aren’t correlated
link
See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/