To help keep track of the race, I put together a simple dashboard to visualize model/provider leaders in capability, throughput, and cost. Hope someone finds it useful!
Familiar! The Artificial Analysis Index is the metric models are sorted by in my sheet. But their data and presentation has some gaps.
I made this sheet to get a glanceable landscape view comparing the three key dimensions I care about, and fill in the missing evals. AA only lists scores for a few increasingly-dated and problematic evals benchmarks. Not just my opinion, none of their listed metrics are in HuggingFace Leaderboard 2 (June 2024).
That said I love the AA Index score because it provides a single normalized score that blends vibe-check qual (chatbot elo) with widely reported quant (MMLU, MT Bench). I wish it composed more contemporary evals, but don't have the rigor/attention to make my own score and am not aware of a better substitute.