|
|
|
|
|
by adiraja
545 days ago
|
|
Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that. But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU. The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response. |
|