|
|
|
|
|
by languid-photic
48 days ago
|
|
Yes, the signal we are measuring is quite different from most evals. We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review? Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1] Almost every agent in a given run can pass tests at this point, but there is large separation during review. [1] https://voratiq.com/blog/your-workflow-is-the-eval |
|