| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JamesBarney 366 days ago

> With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

I've found the same thing. That claude is more likely miss a bug than o3 or gemini but more likely to catch something o3 and gemini missed. If I had to pick one model I'd pick o3 or gemini, but if I had to pick a second model I'd pick opus.

It's also seems to have a much higher false positive rate where as gemini seems to have the lowest false positive rate.

Basically o3 and gemini are better, but also more correlated which gives opus a lot of value.