| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by theturtletalks 479 days ago
	I personally use Aider's Polyglot Benchmark [0] which is a bit low-key and not gamed just yet. It matches my experience too where Claude Sonnet 3.5 is the best and still beats the new reasoning models like o3-mini, DeepSeek, etc. 0. https://aider.chat/docs/leaderboards/

4 comments

KaoruAoiShiho 479 days ago

Sonnet is literally lower on the aider benchmark you just linked. It's only the top with Deepseek as architect, otherwise it's lower than many others.

link

refulgentis 479 days ago

Let's steelman a bit: once you multiply out the edit accuracy versus completion accuracy, Sonnet, on its own, is within 5% of the very top one not using sonnet.

link

theturtletalks 479 days ago

Yes, but I use Cursor Composer Agent mode with Sonnet which is like Aider's architect mode where 1 LLM is instructing another one. Not to mention the new reasoning models can't use tool calling (except o3-mini which is not multi-modal).

link

KaoruAoiShiho 479 days ago

Me too, cursor+sonnet is also my go to, I just didn't really understand what you were getting at by pointing out this benchmark. I guess it is significant that Sonnet is the actual line by line coder here. It is the best at that, and it's better than DeepSeek+any other combination and better than Any other reasoner+Sonnet.

link

theturtletalks 479 days ago

Yes I've followed this benchmark for a while and before Deepseek + Sonnet Architect took the top spot, Sonnet was there alone followed by o1 and Gemini EXP. This is one of the few benchmarks where Sonnet is actually on top like my experience shows, other popular ones have 03-mini and DeepSeek r1 which fall short in my opinion.

link

nyrikki 479 days ago

Quite the corpus for Exercism tasks that were almost certainly trained on, which could lead this to doing what we know LLM/LRM's are good at...approximate retrieval.

https://github.com/search?q=Exercism&type=repositories

link

yunwal 479 days ago

Are Exercism coding exercises really low key? I thought it was like the standard free platform for learning a new language now

link

theturtletalks 479 days ago

Low-key as in many people don't check this leaderboard as much as the other high profile ones.

link

azinman2 479 days ago

Would love if they put latency in this too.

link