Hacker News new | ask | show | jobs
by theturtletalks 479 days ago
I personally use Aider's Polyglot Benchmark [0] which is a bit low-key and not gamed just yet. It matches my experience too where Claude Sonnet 3.5 is the best and still beats the new reasoning models like o3-mini, DeepSeek, etc.

0. https://aider.chat/docs/leaderboards/

4 comments

Sonnet is literally lower on the aider benchmark you just linked. It's only the top with Deepseek as architect, otherwise it's lower than many others.
Let's steelman a bit: once you multiply out the edit accuracy versus completion accuracy, Sonnet, on its own, is within 5% of the very top one not using sonnet.
Yes, but I use Cursor Composer Agent mode with Sonnet which is like Aider's architect mode where 1 LLM is instructing another one. Not to mention the new reasoning models can't use tool calling (except o3-mini which is not multi-modal).
Me too, cursor+sonnet is also my go to, I just didn't really understand what you were getting at by pointing out this benchmark. I guess it is significant that Sonnet is the actual line by line coder here. It is the best at that, and it's better than DeepSeek+any other combination and better than Any other reasoner+Sonnet.
Yes I've followed this benchmark for a while and before Deepseek + Sonnet Architect took the top spot, Sonnet was there alone followed by o1 and Gemini EXP. This is one of the few benchmarks where Sonnet is actually on top like my experience shows, other popular ones have 03-mini and DeepSeek r1 which fall short in my opinion.
Quite the corpus for Exercism tasks that were almost certainly trained on, which could lead this to doing what we know LLM/LRM's are good at...approximate retrieval.

https://github.com/search?q=Exercism&type=repositories

Are Exercism coding exercises really low key? I thought it was like the standard free platform for learning a new language now
Low-key as in many people don't check this leaderboard as much as the other high profile ones.
Would love if they put latency in this too.