|
|
|
|
|
by eranation
6 days ago
|
|
gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work) Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes. I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated. |
|
I tested Coducky (my AI review macos app) on the full 50-PR Martian benchmark using qwen3.7-plus via OpenRouter as the reviewer with a lightweight pre-save precision gate with deepseek-v4-flash. The score (gpt 5.2 judge) was 43.0% precision / 35.8% recall / 39.0 F1. That puts it about inline with CodeRabbit. This cost around $7 to run the full 50 PRs.
Your post inspired me to set up a test harness for my app to continue to test model combinations. Coducky allows you to select whichever models/subscriptions you like to run reviews, but it could make sense to build a collection of model combinations that work well for this.