|
|
|
|
|
by languid-photic
137 days ago
|
|
Good point. This post measures `1x top-N` (one attempt each from N models), not `Nx top-1` (N attempts from the best-scoring model). We should make that more clear. Part of why we chose `1x top-N` is that we expect lower error correlation compared to `Nx top-1`, which is also why the iid baseline is likely optimistic. That said, a direct comparison (`Nx top-1` vs `1x top-N`, with the same review/compute budget) would be useful! |
|
I would like to continue the likelihood calculation.