It actually rendered an SVG inline in the HTML page. I just tested the SVG and it renders itself just fine, including colors. So, tbh, I'd say the task has been properly achieved.
Just curious, can you share what are those hardest puzzles that even the top models can't crack? sometimes when I find the puzzle absolutely undecipherable I like to ask LLMs to solve it, and I haven't seen them fail yet.
Are you next going to say YouTube rankings don't take into account videos that aren't on YouTube and Spotify rankings don't take into account songs that aren't on Spotify?
OpenRouter rankings frustrate me, because they show the total number of tokens but they provide no indication of how many unique users a model has.
Which means if a surprise model tops the leaderboard one week we can never be sure if it was because a single whale user pushing billions of tokens a day switched to it, or if it represents a genuine community trend towards that model.
Yeah we should do something to indicate cardinality. I can share that there can often (I'm talking generally; not related to this model in particular) be e.g. a very large app that can be pushing a lot of volume. But in almost all cases that app has a large number of end users. Hypothetically, for instance, would Cursor be consider one user, or millions?
I'd consider Cursor one user because it's one entity that made an editorial decision about which model to make available to their own community.
If you treated Cursor as millions of users it might look like millions of people independently chose a new model when actually it was Cursor making the choice for them - and the thing I care most about is how many choices were made that selected a model and put it above the others.
An alternative viewpoint is that the single choice made about switching the Cursor model was done after extensive testing by a competent and experienced team. Whereas my naive self choosing a model to play with this week is far less a signal to others that the model is fit for purpose.
One idea I had was to count # of distinct API keys that have spent atleast $100 (number's flexible), which would be enough to provide guidance on if the traffic is from a single power-user.
In the Cursor case which is BYOK, that would count as distinct API keys.
Hi! Big fan of OpenRouter and the data you provide. It'd be awesome if you would consider providing volume of tokens per hour, mostly for my own curiosity as to quite how peaky demand is.
Also, while we're pitching new features to openrouter, I'd like to see a "$ spent" chart, which would remove all these huge freebie spikes. It looks like it would be pretty much dominated by claude.
We were talking about whether these metrics are meaningful. I was just pointing out that even a tiny one-person company can burn a lot of tokens.
As to whether the token spend is questionable, the number I quoted is for my production AI pipelines, not for coding. And my customers (and profit margin) seem to think the spending is valuable.
Questionable spending aside, GGP is providing information about how a specific metric may not measure what people think it measures. There is value in that comment.
So basically, Hy3 is the cheapest decent model on OpenRouter, unless you use DeepSeek as the provider for DeepSeek V4 Flash, in which case DeepSeek's insane caching wins out. (And Hy3 is close-ish on the benchmarks.)
You need to use DeepSeek API directly to gain the extra caching benefits. The DeepSeek provider on OpenRouter is only the 5th-cheapest for V4 Flash, so you have to specify DeepSeek provider when calling OpenRouter. But DeepSeek's API discounts on its models only applies if you call DeepSeek directly. So anyone using OpenRouter to call DeepSeek models is actually losing quite a bit of money.
> The DeepSeek provider on OpenRouter is only the 5th-cheapest for V4 Flash
You might have the default settings on your account, which limit Deepseek as a provider. If you disable that feature you see them on openrouter as well (and they serve it at the same cost as their own API).
However, I just double checked, and OpenRouter's pricing page for Flash v4 with DeepSeek provider shows a cache hit rate of $0.0028, which is the same as on DeepSeek's official API pricing page ($0.0028), so they do seem to be the same price, (assuming DeepSeek is able to pin your specific OpenRouter requests to the same DeepSeek server). OpenRouter adds 5% to that cost, but still it might be cheaper than the other providers.
Also just found out OpenRouter has a new feature "Response Caching" where they can cache identical requests and return them immediately with no billing. The entire request must be identical, though, not just a prefix, and you have to enable this feature. I don't know who would need to send multiple identical requests, but it's better than nothing?
Interesting, it seems we have some providers offering dsv4-flash cheaper than ds themselves. For the full model it's the other way around, all 3rd party providers are 2x+ more expensive.
The cheaper ones are fp4 and fp8 whereas I assume DeepSeek provider is unquantized, so that probably accounts for it. DeepSeek also doesn't necessarily have the cheapest hardware, other providers could be using it as a loss leader, etc
> it makes sense that a cheaper model would prevail, but only if it offered similar quality
You're trying to think logically, which has no place in an AI discussion. :) People just jump to whatever the latest model is. Plenty of people also prefer price to "quality" (which is very subjective). It's new, it's cheap, so people use it. It's likely people will stop using it when something else is cheaper and/or newer.
Can you share more? I'm with OpenRouter and we would love to address this! We don't see this in our own testing, I don't believe -- but will share this feedback and dig in.
Just try. In a case last week it was ~3x and I tried multiple providers: deepseek, gmicloud/fp8, novita/fp8, and another one I can't remember. It was a large job where at least 2/3rds of the start of the prompts was exactly the same (literally a static string).
Then I read somewhere (I think X) that OpenRouter adds stuff and breaks caching (telemetry? headers? can't remember). So I stopped the job, switched to actual DeepSeek provider, and voilá, caching 3x more tokens per request (on average).
Since there’s only one inference provider it could be a recycling/ad experiment. The similar usage between trial and paid periods would be explained by this as well.
Tried this extensively in OpenCode, never used it once since Gemma 4 came out, got into thought loops and did stupid edits I didn't ask for more often than the local 31b model. One of the worst "frontier" models I've ever tried.
This article got me messing with it, and I'm loving it as a post-training target.
Training on ~1B tokens on 8xB300 and the first checkpoint halfway in learned really well. Tencent might be struggling with agentic work, but the base knowledge is there.
For the life of me I will never understand the thought process that leads you to say "we don't really know who developed this LLM but I'm going to feed all of my business's data to it"
OpenAI & Anthropic are deeply in bed with US govt, and they need US govt approval before model releases, and all US Companies under various acts need to share data with the govt.
I mean sure there are investors and a little more open-ness, but with the example of Mythos we don't even know if public will get access to the "good" stuff because it's too dangerous.
If your only opinion on trusting these companies more than one based in China is, they are Chinese then good luck, all the best.
The difference is "the various acts" in the US are things that are largely very hard to do, extremely limited in scope, and companies who dispute the government's propriety can (and do) go to court to fight it.
Sure "China bad, US good" is naive, but certainly not more naive than suggesting that companies and individuals have similar rights and protections as each other.
> and they need US govt approval before model releases
This is just not true and it would be a gigantic legal battle to make it true against the model companies' wishes, which is indicative of your entire misunderstanding here.
There was recently some announcement from the US govt itself (after the Mythos announcement) that they were pondering about allowing model releases from now on only after approving them.
So it may not be strictly true for the moment, but it is certainly something that the current US govt can mandate at any time.
You don't need to know who developed the LLM - whether it was Google or OpenAI.
What you need to know is who is the provider for the LLM, and whether their endpoints are zero data retention enabled and opted out of training. OpenRouter gives you an easy way to control this.
This is not entirely true and ignoring a couple of potential attack vectors like Data Poisoning: https://arxiv.org/abs/2408.12798
Its of course highly dependant on the use case and the environment, but simply saying that the only important part is to know where the data goes is too simple.
OpenRouter and the provider sign a contract clearly specifying how input data is to be handled.
It's the same way we trust OpenAI to not train on our data if we've opted out although there is no control on whether they can retain the data indefinitely.
I really dont want to be cynic but those guys gave a flying f””” about copyright while scraping the whole internet. How can I ever trust them to respect the oot-out setting. I cant. Thieves be thieves.
And even if they dont train on the data. Who guarantees us, they dont let another AI model analyse all the data, exfiltrating all kinds of intelligence and using it? I only can imagine what OpenAI and Anthropic know….
Scraping the internet isn't a copyright violation. Using it for LLM training is much more transformative than Google and Internet Archive, which are legal.
(Transcript: https://gist.github.com/simonw/c2a0d8ecd3056a2681319eae8fc3f...)