| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by criemen 129 days ago
	One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out.

1 comments

martinald 129 days ago

This was my assumption - GB200 memory bandwidth is 2.4x faster than H100, so I think personally that's all it is. Doesn't really make sense otherwise as yes there are tricks to get faster time to first token but not really for the same model in throughput terms (speculative decoding etc, but they already use that).

I'm happy to be wrong but I don't think it's batching improvements.

link