Y
Hacker News
new
|
ask
|
show
|
jobs
by
stared
144 days ago
Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved.
See e.g.:
https://quesma.com/benchmarks/otel/models/claude-opus-4.5/