|
|
|
|
|
by KatrKat
1162 days ago
|
|
I think some domain-specific considerations include: 1. You need a really big in-memory data set that you touch ~all of several times for each request, so you really want to e.g. memory-map it and make sure it actually fits in memory on the machine. 2. If using a GPU, you have to make sure the GPU is hooked up to the serving process. You probably want your processes to be heavier-weight than they otherwise would be. 3. You might want to batch requests from several users for processing in the same stream of commands to the GPU. So you need to collect the right number of requests before processing any of them, without making any requests wait too long. You might need to sort these out by what inference parameters they want to override, and send them to different servers, because they might need to be batched accordingly. 4. You might want to stream the output more or less character by character. Possibly to several users, from one live run on a GPU, after having batched up enough requests to justify a run. 5. Content moderation when you are sending data to the user before you have even seen all of it yourself is an unsolved problem. |
|
1. There are many GPU based applications already in production, I've seen work queues, which are used in any system where the load exceeds the capacity, GPU or not.
2. Content moderation is not unique to LLMs
3. Training and serving users at inference time are different beasts