| I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article. Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services. Anthropic offering 2.5x makes me assume they have 5x or 10x themselves. In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone. |
They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504
LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.
The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.
This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.