|
|
|
|
|
by Mernit
1015 days ago
|
|
What are you using for K8s autoscaling? We initially tried a few standard K8s scaling mechanisms and found that they didn't work well for GPU workloads. For example, if we were serving a low-RAM Huggingface model on GPU, it wouldn't trigger autoscaling. But since the GPU can only process one request at a time, the system would get bottlenecked while it waited to process requests one-by-one. We wrote a bit about this here, if anyone is interested: https://www.beam.cloud/blog/serverless-autoscaling |
|
Orthogonal solution but could help.