| - SSH access isn’t super useful. If I have to author a bootstrapping script for my system it’s too much friction. - the people who thrive at this use orchestration, like Slurm or Kubernetes. So the nodes I buy should join automatically to my orchestration control plane. - people who don’t use orchestration or don’t own their orchestration will not run big jobs or be repeat customers. It doesn’t make sense to use nonstandard orchestration. I understand that it is something that people do, but it’s dumb. - so basically I would pay for a ClusterAutoscaler across clouds. I would even pay a 5% fee for it automatically choosing the cheapest of the fungible nodes. I am basically describing Karpenter for multiple clouds. Then at least the whole offering makes sense from a sophisticated person’s POV: your Karpenter clone can see eg a Ray CRD and size the nodes, giving me a firm hourly rate or even upfront price to approve. - I wouldn’t pay that fee to use your control plane, I don’t want to use a startup’s control plane or scheduler. - I’m not sure why the emphasis on GPU availability or blah blah blah. Either AWS/GCE/AKS grants you quota or it doesn’t. Your thing ought to delegate and automate the quota requests, maybe you even have an account manager at every major cloud for that to bundle it all. - as you probably have noticed, the off brand clouds play lots of games with their supposed inventory. They don’t have any expertise running applications or doing networking, they are ex crypto miners. I understand that they offer a headline price that is attractive but for an LLM training job, they “vast”ly overpromise their “core” offering. - if you really want to save people money on GPUs, buy a bunch of servers and rack them and sell a lower hourly rate. |
- We agree that moving towards 'Karpenter for multiple clouds' would be more valuable for some use cases and hope to support that feature soon.
- We do help customers with one-off quota requests, and it is a feature we want to bake into our platform on top of aggregating demand in our accounts. Many companies with AWS/GCE/AKS quota still cannot reliably get on-demand instances due to capacity shortages.