|
|
|
|
|
by fulafel
564 days ago
|
|
Reading this my first reaction was that my questions still holds. Unless the slowness of the LLMs is of such a magnitude that having a thread or process waiting on the API call would substantially increase its cost, which I guess would mean the LLM server endpoint would be doing very heavy queuing and/or multitasking instead utilizing a powerful compute element for 90% of the call duration. Or maybe the disconnect is where I'm taking for granted that "cost of parked thread" is the same as worrying about the nr of parked threads? Maybe everyone uses Django setups where it's nontrivial to add memory, increase serverless platform limits, etc if you get the happy problem of 10k concurrent users? Or maybe people don't know that the nr of threads/processes you can have on Linux is much more than hundreds or thousands. Or maybe there's some Python or Django specific limits to this. |
|