|
|
|
|
|
by roughly
566 days ago
|
|
> Can you expand more on why these AI cases make the complexity tradeoff different? They’re very slow. Like, several seconds to get a response slow. If you’re serving a very large number of very fast requests, you can argue that the simplicity of the sync model makes it worth it to just scale up the number of processes required to serve that many requests, but the LLM calls are slow enough that it means you need to dramatically scale up the number of serving processes available if you’re going to keep the sync model, and that’s mostly to have CPUs sitting around idle waiting for the LLM to come back. The async model can also let you parallelize calls to the LLMs if you’re making multiple independent calls within the same request - this can cut multiple seconds off your response time. |
|