Hacker News new | ask | show | jobs
by daemonologist 441 days ago
To some extent the "mystery" (and temporary free-as-in-beer-ness) of this model might be getting to me, but I think it's pretty interesting. Given the token throughput (250B this week) it's obvious there's a pretty major player behind the model, but why is it stealthed? Maybe there's something about the architecture or training that would put people off if it was public right off the bat? Maybe they're purely collecting usage/acceptance data and want unbiased users?

On the Aider Polyglot leaderboard it's ~middle of the leading pack, comparable to DeepSeek V3 and 3.5 Sonnet. I ran NoLi(teral)Ma(tching), an unsaturated long-context benchmark, on it and was impressed though:

  = Model =========== Base Score = 8K Context = 16K Context =
  Quasar Alpha:       >=97.8%      89.2%        85.1%
  GPT-4o:             99.3%        89.2%        81.6%
  Llama 3.3 70B:      97.3%        72.1%        59.5%
  Gemini 1.5 Pro:     92.6%        63.9%        55.5%
  Claude 3.5 Sonnet:  87.6%        61.7%        45.7%
  Gemini 1.5 Flash:   84.7%        44.4%        35.5%
  GPT-4o mini:        84.9%        32.6%        20.6%
  Llama 3.1 8B:       76.7%        31.9%        22.6%
It also performs well - slightly better than GPT-o1 - on the "hard" subset at 16K context with 62.8%. Latency is quite good as well.

More details: https://old.reddit.com/r/LocalLLaMA/comments/1ju1czn/quasar_...

1 comments

What is the reason you included Claude 3.5 instead of 3.7 in this?
I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.

* - I also reproduced the Llama 3.1 8B result to check my setup.

[0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa