Y
Hacker News
new
|
ask
|
show
|
jobs
by
stingraycharles
44 days ago
That’s probably more work than the entire repo itself. Would need to be something like SWE-bench with and without “adamsreview”.
You’re right though, but evals are actually fairly tricky to write and maintain.