| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ntonozzi 99 days ago
	Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM. It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.

3 comments

lukebechtel 99 days ago

Yes, speculative decoding will make both us and VLLM faster, but we believe it would be a relatively even bump on both sides, so we didn't include it in this comparison. Worth another test!

link

nyrikki 98 days ago

> Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts?

You don’t even get that with GPUs in general, or really floating point in general.

The Art of Computer Programming. Volume 2: Seminumerical Algorithms section 4.2.2 with explain where it loses floating addition associativity property.

Apartness relations are another possible lens.

link

ntonozzi 96 days ago

Yeah you can: https://thinkingmachines.ai/blog/defeating-nondeterminism-in....

link

nyrikki 95 days ago

> However, as the name “batch-invariant” suggests, the technique is currently limited to handling variations related only to the batch dimension, making it robust to continuous batching and other batch-size–related changes, but not to other forms of nondeterminism like changing the TP sizes or GPU types.

https://arxiv.org/abs/2506.09501

link

jeeeb 98 days ago

> It is also a bit weird that they are not incorporating speculative decoding

Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?

link

YetAnotherNick 98 days ago

For compute bound region(high batch size) yes, but for low batch size it could improve the throughput.

link