|
|
|
|
|
by criemen
129 days ago
|
|
One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out. |
|
I'm happy to be wrong but I don't think it's batching improvements.