| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jawon 137 days ago

I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.

Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.

Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.

4 comments

Aurornis 137 days ago

> Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504

LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.

The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.

This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.

stavros 137 days ago

This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.

brookst 137 days ago

All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.

stavros 137 days ago

Right, but that's still not Anthropic adding an intentional delay for the sole purpose of having you pay more to remove it.

yunohn 136 days ago

But it’s actually not so difficult is it? The simplest way to make a slow pool is by having fewer GPUs and queuing requests for the non-premium users. Dead simple engineering.

stavros 136 days ago

No, the simplest way is `sleep(10)`.

yunohn 136 days ago

No, that wastes actual GPU resources and money. The method I described saves them money on the cheaper pool.

Regardless, I sense you’re being sarcastic and difficult, so I have no desire to discuss this further.

brookst 137 days ago

Oh, of course. That’s just conspiratorial thinking. Paying to be in a premium pool makes sense, all of this “they probably serve rotten food to make people pay for quality food” nonsense is just silly.

landl0rd 137 days ago

What they are probably doing is speculative decoding, given they've mentioned identical distribution at 2.5x speed. That's roughly in the range you'd expect for that to achieve; 10x is not.

It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.

crowbahr 137 days ago

Where on earth are you getting these numbers? Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?

This is such bizarre magical thinking, borderline conspiratorial.

There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.

jawon 137 days ago

Not magical thinking, not conspiratorial, just hypothetical.

Just because you can't afford to 10x all your customers' inference doesn't mean you can't afford to 10x your inhouse inference.

And 2.5x is from Anthropic's latest offering. But it costs you 6x normal API pricing.

jawon 137 days ago

Also, from a comment in another thread, from roon, who works at OpenAI:

> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow

[0] https://nitter.net/tszzl/status/2016338961040548123

falloutx 137 days ago

Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.

Aurornis 137 days ago

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.

falloutx 137 days ago

They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.

Aurornis 137 days ago

Do you have any evidence that this is happening? Or is it just a hypothetical threat you're proposing?

These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.

falloutx 137 days ago

They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.

kolinko 137 days ago

Are you at all familiar with the architecture of systems like theirs?

The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.

falloutx 137 days ago

I am familiar with the business model. This is clear indication of what their future plan is.

Also, I just pointed out at the business issue, just raising a point which was not raised here. Just want people to be more cautious

blackqueeriroh 137 days ago

So you are not familiar with the system architecture. Okay.

throw310822 137 days ago

Slowing down respect to what?

falloutx 137 days ago

Slowing down with respect to original speed of response. Basically what we used to get few months back and what is the best possible experience.

throw310822 137 days ago

There is no "original speed of response". The more resources you pour in, the faster it goes.

falloutx 137 days ago

Watch them decrease resources for the normal mode so people are penny pinched into using fast mode.

throw310822 137 days ago

Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.