User: bisonbear | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

user: bisonbear
created: 2025-09-17
karma: 45

Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.sh

submissions:

I compared 5 popular token saving methods in Codex and found that none delivered

2 points | 0 comments

0 points | 0 comments

I ran Sonnet 5 vs. Opus 4.8 head to head on 24 tasks to see what's different

1 points | 0 comments

0 points | 0 comments

I evaluated GLM 5.2 against the frontier on tasks from real repos

2 points | 2 comments

0 points | 0 comments

I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos

3 points | 0 comments

0 points | 0 comments

I used autoresearch to improve my AGENTS.md, measured against real tasks

8 points | 7 comments

A brief investigation into the GPT-5.5 regression claims

1 points | 0 comments

0 points | 0 comments

The Opus 4.7 reasoning curve - Medium is the best default?

1 points | 0 comments

0 points | 0 comments

GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks

2 points | 0 comments

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo

4 points | 0 comments

0 points | 0 comments