|
|
|
|
|
by nikcub
146 days ago
|
|
> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link |
|
E.g. some binomial interval proportions (aka confidence intervals).