|
|
|
|
|
by jug
667 days ago
|
|
Yes, and I also noted how it beats Claude 3.5 Sonnet in Chatbot Arena by a bit of a margin. This further feeds into my concern that the more advanced AI models we get, random enthusiasts at that site may no longer be able to rank them well, and tuning for Chatbot Arena might be a thing. One that is also exploited by GPT-4o. GPT-4o absolutely does not rank wildly ahead of Claude 3.5 Sonnet in a wide variety of benchmarks, yet it does in Chatbot Arena... People actually using Claude 3.5 Sonnet are also quite satisfied with its performance, often ranking it more helpful than GPT-4o when solving engineering problems, but at the expense of tighter usage limits. Chatbot Arena was great when they were still fairly stupid, but these days, remember that everyday people are put against the task of ranking premium LLM's even solving some logic puzzles, trick questions and with a deep general knowledge far beyond that of singular humans. They can strike against traditional weaknesses like math, but then all of them suffer. So it's not an easy task at all and I'm not sure the site is very reliable anymore other than for smaller models. |
|
You can review for yourself and decide if it was justified (you can compare based on W/L/T responses and matchups). Generally, Claude still has more refusals (easy wins for the model that actually answers the request), often has worse formatting (arguable if this is better, but people like it more), and is less verbose (personally, I'd prefer the right answer with less words, but ChatArena users generally disagree).
If you look at the questions (and Chat Arena and Wildchat analyses), most people aren't using LLMs for math, reasoning, or even coding - if anything the arena usage is probably overly skewed to reasoning/trick questions due to the subset of people poking at the models.
Of course, different people value different things. I've almost exclusively been using 3.5 Sonnet since it came out because it's been the best code assistant and Artifacts are great, only falling back to GPT-4o for occasional Code Interpreter work (for tricky problems, Mistral's Codestral actually seems to be a good fallback, often being able to debug issues that neither of those models can, despite being a tiny model in comparison).