| > I am struggling to connect the relevance of this > focusing on a single specific capability and > I am not really invested in this niche topic Right: I definitely ceded a "but it doesn't matter to me!" argument in my comment. I sense a little "doth protest too much", in the multiple paragraphs devoted to taking that and extending it to the underpinning of automation is "irrelevant" "single" "specific", "niche". This would also be news to DeepSeek, who put a lot of work to launch it in the r1 update a couple weeks back. Separately, I assure you, it would be news to anyone on the Gemini team that they don't care because they want to own everything. I passed this along via DM and got "I wish :)" in return - there's been a fire drill trying to improve it via AIDER in the short term, is my understanding. If we ignore that, and posit there is an upper management conspiracy to suppress performance, its just getting public cover by a lower upper management rush to improve scores...I guess that's possible. Finally, one of my favorite quotes is "when faced with a contradiction, first check your premises" - to your Q about why no one can compete with DeepSeek R1 25-01, I'd humbly suggest you may be undergeneralizing, given even tool calls are "irrelevant" and "niche" to you. |
In the top 10, are models by OpenAI (gpt4omini), Google (gemini flashes and pros), Anthropic (Sonnets) and Deepseeks'. Even though the company list grows shorter if we instead look at top model usage grouped by order of magnitude, it retains the same companies.
Personally, the models meeting my quality bar are: gpt 4.1, o4-mini, o3, gpt2.5pro, gemini2.5flash (not 2.0), claude sonnet, deepseek and deepseek r1 (both versions). Claude Sonnet 3.5 was the first time I found LLMs to be useful for programming work. This is not to say there are no good models by others (such as Alibaba, Meta, Mistral, Cohere, THUDM, LG, perhaps Microsoft), particularly in compute constrained scenarios, just that only Deepseek reaches the Quality tier of the big 3.