|
|
|
|
|
by 7777777phil
164 days ago
|
|
Cool project, I feel like I have been running my own mental, gut feeling degeneration tracker so far. - Is a daily sample size of 50 really sufficient to distinguish actual model degradation from the inherent stochasticity of SWE-bench?
- Since you're running directly in the CLI, do you also track 'silent' degradations like increased token usage or latency, or is it strictly pass/fail?
- What are the§ API costs for daily Opus 4.5 runs on 50 SWE tasks? |
|
Regarding more subtle degradation tracking, it is on the roadmap.