| HN Mirror

That must have been expensive! Thanks for running the benchmark and sharing.

I tested Coducky (my AI review macos app) on the full 50-PR Martian benchmark using qwen3.7-plus via OpenRouter as the reviewer with a lightweight pre-save precision gate with deepseek-v4-flash. The score (gpt 5.2 judge) was 43.0% precision / 35.8% recall / 39.0 F1. That puts it about inline with CodeRabbit. This cost around $7 to run the full 50 PRs.

Your post inspired me to set up a test harness for my app to continue to test model combinations. Coducky allows you to select whichever models/subscriptions you like to run reviews, but it could make sense to build a collection of model combinations that work well for this.