Hacker News new | ask | show | jobs
by sberens 322 days ago
Interesting there doesn't seem to be benchmarking on codeforces
1 comments

I'm a codeforces guy, and I've benchmarked o3 on several of my favorite problems of various difficulty and concluded that o3 really isn't suitable for true reasoning still. Mostly because it's unable to think from first principles, so if you throw a non-standard problem it will brick. I think this will be a fundamental issue with any LLM.

I will say I would far more appreciate an AI that when it faces these ambiguous problems, either provides sources for further reading, or just admits it doesn't know and is, you know, actually trying to work together to find a solution instead of being trained to 1 shot everything.

When generalizing these skills to say, debugging, I will often just straight up ignore the AI slop output it concluded and instead explore the sources it found. o3 is surprisingly good at this. But for hard niche debugging, the conclusions it comes to are not only wrong, but it phrases it in an arrogant way and when you push back it's actually like talking to a narcissist (phrasing objections as "you feel", being excessively stubborn, word dumping a bunch of phrases that sound correct but don't hold up to scrutiny, etc).