Grok3 is first model to surpass 1400 on the Chat Arena benchmark

In general the Chatbot Arena leaderboards aren't considered super-reliable benchmarks anymore, especially since they allow model makers to effectively hill-climb on them by pre-releasing models. That being said, Grok 3 has also done quite well on many standard benchmarks; my personal vibecheck of asking some tricky math problems and various Kubernetes architecture questions place it around DeepSeek V3, for me. The thinking mode isn't generally available yet, and I expect that will be significantly better, and probably put it around R1 and o1 in real-world usage.

It doesn't feel groundbreaking to me yet — it doesn't feel consistently better than the rest of the frontier models — but it's definitely a frontier model. Congratulations to the xAI team for getting to the frontier so quickly.