|
|
|
|
|
by IMTDb
136 days ago
|
|
I’m curious what’s behind the speed improvements. It seems unlikely it’s just prioritization, so what else is changing? Is it new hardware (à la Groq or Cerebras)? That seems plausible, especially since it isn’t available on some cloud providers. Also wondering whether we’ll soon see separate “speed” vs “cleverness” pricing on other LLM providers too. |
|
Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.
See this graph for actual numbers:
Token Throughput per GPU vs. Interactivity gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™
https://inferencemax.semianalysis.com/