Y
Hacker News
new
|
ask
|
show
|
jobs
user:
bisonbear
created:
2025-09-17
karma:
43
Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.sh
submissions:
0 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos
3 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
I used autoresearch to improve my AGENTS.md, measured against real tasks
8 points
|
7 comments
A brief investigation into the GPT-5.5 regression claims
1 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
The Opus 4.7 reasoning curve - Medium is the best default?
1 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks
2 points
|
0 comments
GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo
4 points
|
0 comments
0 points
|
0 comments
I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks
2 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments
Coding evals are broken. CI is green while AI code quality goes unmeasured
1 points
|
0 comments
0 points
|
0 comments
Agents.md is the highest-leverage code you're not testing
1 points
|
0 comments
0 points
|
0 comments
0 points
|
0 comments