| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eranation 15 days ago

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)

4 comments

lizhengfeng101 10 days ago

Thank you for sharing these results and for running the evaluation! We noticed that in the version tested, there was an anomaly in a critical tool call that significantly impacted the overall performance — particularly contributing to the high false positive rate you observed. We were able to reproduce the issue on the benchmark and have since fixed it. We appreciate you taking the time to highlight this, and we look forward to seeing how it performs on the full 50-sample evaluation!

link

akie 15 days ago

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

link

witx 15 days ago

What, no they're not. You still need to analyze them to understand they are false positives. It's time wasted

link

chaoz_ 15 days ago

Agree, it's something that will eventually teach your developers to ignore points raised as it's mostly garbage.

link

onion2k 15 days ago

Finding problems is optimizing for the customer. Avoiding false positives is optimizing for the developer. Which is right depends on your org's culture.

link

evolve-maz 15 days ago

If I flag every line in your PR as a potential security bug then I have 100% recall.

Obviously you need a mixture of high recall and low false positive rate. If 7/8 flagged items are fine its much more likely people will ignore the warnings, much like they would any security tool with a 90% false positive rate. That is not optimized for the customer.

link

onion2k 15 days ago

The ideal is finding all the problems without getting any false positives, but the reality is that you can't often have that. An org's engineering culture should be designed to fix problems with systems. If you're seeing an 87.5% false positive rate that should be seen as another engineering problem to fix. However, that's a separate issue to whether or not you accept false positives in a system designed to find problems.

Presenting it as either a system that misses real problems or a system that has a huge number of false positives is a false dilemma. You can have a system that's designed to find all the problems and then optimize it to reduce the false positives. If you can't reduce the number then you optimize to identify false positives as fast as possible. Just ignoring the identified problems on the assumption that they're false is giant red flag and a signal that the org has a very a broken engineering culture (but, as you say, that's quite common.)

link

eranation 15 days ago

Yep. Similarly - you can predict with 99.9% accuracy if a Volcano will erupt today by using a rock that has "No" written on it.

link

williamdclt 15 days ago

> If I flag every line in your PR as a potential security bug then I have 100% recall.

No. A code review isn't about "flagging a line of code", it's about identifying an issue or a risk. If a 10-line PR has one issue and you leave a comment on every single character, if you still miss the issue you have 0% recall.

link

tirpen 15 days ago

Which LLM did you use? I assume that will make a pretty big difference.

link

eranation 15 days ago

gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work)

Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes.

I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated.

link

jayphen 13 days ago

That must have been expensive! Thanks for running the benchmark and sharing.

I tested Coducky (my AI review macos app) on the full 50-PR Martian benchmark using qwen3.7-plus via OpenRouter as the reviewer with a lightweight pre-save precision gate with deepseek-v4-flash. The score (gpt 5.2 judge) was 43.0% precision / 35.8% recall / 39.0 F1. That puts it about inline with CodeRabbit. This cost around $7 to run the full 50 PRs.

Your post inspired me to set up a test harness for my app to continue to test model combinations. Coducky allows you to select whichever models/subscriptions you like to run reviews, but it could make sense to build a collection of model combinations that work well for this.

link

bobkb 15 days ago

False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.

link