Hacker News new | ask | show | jobs
user: bisonbear
created: 2025-09-17
karma: 43

Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.sh

submissions:

0 points | 0 comments
0 points | 0 comments
0 points | 0 comments
I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos
3 points | 0 comments
0 points | 0 comments
0 points | 0 comments
0 points | 0 comments
I used autoresearch to improve my AGENTS.md, measured against real tasks
8 points | 7 comments
A brief investigation into the GPT-5.5 regression claims
1 points | 0 comments
0 points | 0 comments
0 points | 0 comments
0 points | 0 comments
0 points | 0 comments
0 points | 0 comments
The Opus 4.7 reasoning curve - Medium is the best default?
1 points | 0 comments
0 points | 0 comments
0 points | 0 comments
0 points | 0 comments
GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks
2 points | 0 comments
GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo
4 points | 0 comments
0 points | 0 comments
I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks
2 points | 0 comments
0 points | 0 comments
0 points | 0 comments
Coding evals are broken. CI is green while AI code quality goes unmeasured
1 points | 0 comments
0 points | 0 comments
Agents.md is the highest-leverage code you're not testing
1 points | 0 comments
0 points | 0 comments
0 points | 0 comments