|
I ran an interesting benchmark/experiment yesterday, which did not do Quasar Alpha any favors (from best to worst, score is an average of four runs): "google/gemini-2.5-pro-preview-03-25" => 67.65
"anthropic/claude-3.7-sonnet:thinking" => 66.76
"anthropic/claude-3.7-sonnet" => 66.23
"deepseek/deepseek-r1:free" => 54.38
"google/gemini-2.0-flash-001" => 52.03
"openai/o3-mini" => 47.82
"qwen/qwen2.5-32b-instruct" => 44.78
"meta-llama/llama-4-maverick:free" => 42.87
"openrouter/quasar-alpha" => 40.27
"openai/chatgpt-4o-latest" => 37.94
"meta-llama/llama-3.3-70b-instruct:free" => 34.40
The benchmark is a bit specific, but challenging. It's a prompt optimization task where the model iteratively writes a prompt, the prompt gets evaluated and scored from 0 to 100, and then the model can try again given the feedback. The whole process occurs in one conversation with the model, so it sees its previous attempts and their scores. In other words, it has to do Reinforcement Learning on the fly.Quasar did barely better than 4o. I was also surprised to see the thinking variant of Sonnet not provide any benefit. Both Gemini and ChatGPT benefit from their thinking modes. Normal Sonnet 3.7 does do a lot of thinking in its responses by default though, even without explicit prompting, which seems to help it a lot. Quasar was also very unreliable and frequently did not follow instructions. I had the whole process automated, and the automation would retry a request if the response was incorrect. Quasar took on average 4 retries of the first round before it caught on to what it was supposed to be doing. None of the other models had that difficulty and almost all other retries were the result of a model re-using an existing prompt. Based on looking at the logs, I'd say only o3-mini and the models above it were genuinely optimizing. By that I mean they continued to try new things, tweak the prompts in subtle ways to see what it does, and consistently introspect on patterns it's observing. That enabled all of those models to continuously find better and better prompts. In a separate manual run I let Gemini 2.5 Pro go for longer and it was eventually able to get a prompt to a score of 100. EDIT: But yes, to the article's point, Quasar was the fastest of all the models, hands down. That does have value on its own. |