Hacker News new | ask | show | jobs
by vunderba 8 days ago
I use LLMs more in the context of peer-reviewing and also came to a similar conclusion, gpt-5.5 codex xhigh reasoning seemed to catch more edge cases and went "deeper" into analysis than Opus 4.7/4.8.

My preliminary tests of Fable were pretty promising but that's DOA for everyone for now.

1 comments

Claude often spent most of its output listing all the things that were already correct and working! "This is good"

and most of its findings were false positive or outright wrong as in the screenshot I posted above.