| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 7777777phil 164 days ago
	Cool project, I feel like I have been running my own mental, gut feeling degeneration tracker so far. - Is a daily sample size of 50 really sufficient to distinguish actual model degradation from the inherent stochasticity of SWE-bench? - Since you're running directly in the CLI, do you also track 'silent' degradations like increased token usage or latency, or is it strictly pass/fail? - What are the§ API costs for daily Opus 4.5 runs on 50 SWE tasks?

1 comments

qwesr123 164 days ago

Thanks! Daily confidence intervals are quite large and not super useful at the moment. Weekly aggregation is more sensitive. Hoping to increase sample sizes but it is quite expensive! Would be about $100-$150/day in API costs. We are using the Pro x20 subscription ($200/month).

Regarding more subtle degradation tracking, it is on the roadmap.

link

7777777phil 164 days ago

Cheers! Great work! Let me know if there's a way to follow the development.

link