|
|
|
|
|
by px1999
557 days ago
|
|
Imo the con is picking the metric that makes others look artificially bad when it doesn't seem to be all that different (at least on the surface) > we use a stricter evaluation setting: a model is only considered to solve a question if it gets the answer right in four out of four attempts ("4/4 reliability"), not just one This surely makes the other models post smaller numbers. I'd be curious how it stacks up if doing eg 1/1 attempt or 1/4 attempts. |
|