Hacker News new | ask | show | jobs
by doctorpangloss 1033 days ago
- SSH access isn’t super useful. If I have to author a bootstrapping script for my system it’s too much friction.

- the people who thrive at this use orchestration, like Slurm or Kubernetes. So the nodes I buy should join automatically to my orchestration control plane.

- people who don’t use orchestration or don’t own their orchestration will not run big jobs or be repeat customers. It doesn’t make sense to use nonstandard orchestration. I understand that it is something that people do, but it’s dumb.

- so basically I would pay for a ClusterAutoscaler across clouds. I would even pay a 5% fee for it automatically choosing the cheapest of the fungible nodes. I am basically describing Karpenter for multiple clouds. Then at least the whole offering makes sense from a sophisticated person’s POV: your Karpenter clone can see eg a Ray CRD and size the nodes, giving me a firm hourly rate or even upfront price to approve.

- I wouldn’t pay that fee to use your control plane, I don’t want to use a startup’s control plane or scheduler.

- I’m not sure why the emphasis on GPU availability or blah blah blah. Either AWS/GCE/AKS grants you quota or it doesn’t. Your thing ought to delegate and automate the quota requests, maybe you even have an account manager at every major cloud for that to bundle it all.

- as you probably have noticed, the off brand clouds play lots of games with their supposed inventory. They don’t have any expertise running applications or doing networking, they are ex crypto miners. I understand that they offer a headline price that is attractive but for an LLM training job, they “vast”ly overpromise their “core” offering.

- if you really want to save people money on GPUs, buy a bunch of servers and rack them and sell a lower hourly rate.

1 comments

Thank you for the feedback. We're still early in this and are planning on moving in some of the directions you mentioned.

- We agree that moving towards 'Karpenter for multiple clouds' would be more valuable for some use cases and hope to support that feature soon.

- We do help customers with one-off quota requests, and it is a feature we want to bake into our platform on top of aggregating demand in our accounts. Many companies with AWS/GCE/AKS quota still cannot reliably get on-demand instances due to capacity shortages.

Yeah I mean I’m sure you look at Karpenter and think “well it does everything for free, and the code to choose the cheapest node would be straightforward.” Kubernetes already has sophisticated scheduling algorithms that could consider price as a constraint.

I can’t say what will people actually pay for, because CTOs and engineers are penny pinchers, they will go through a lot of pain to pay $0. They are the worst customers. IMO most allegedly B2B Y Combinator offerings are really B2C in disguise, selling productivity apps and pretty interfaces to 22 year olds with busy schedules of Bumble swiping who happen to work as developers and PMs at big enterprises. Because the senior people I know with the real budgets, they look at a thing and think “I’d program this with my headcount to save a 5% fee.” This is coming from someone who does charge a royalty only because it is customary in my business to do so.

People who spend money love their pricing “formatted” a certain way. CTOs love it to be formatted as “free” with a bunch of trickle priced exorbitant usage gotchas (Snowflake). They don’t love prices formatted as royalties. Time will tell of course.

Anyway, most use cases don’t even make sense, they are deep in the negative for ROI. Most enterprises cannot do software R&D like LLM model training or even serving. The biggest success story in town uses Kubernetes. I’m not sure if there’s space for 10 more control planes to run on top of your control planes, they add a lot of complexity for little gain.

A bunch of Kubernetes manifests to fine tune LLaMA 2 on a dataset hosted in blob storage on DGX machines is a commodity. People think it’s sensitive, there’s a bonanza for people who can author that YAML, it’s inevitable that someone will release a proper multi node training job with vanilla resources. Yet here we are, with a dozen “free” trickle priced weird CRD control plane-esque products obscuring this.

Time will tell but I think you're discounting the amount of amateur that will join the gold rush with a dream an angel and just enough knowledge to run a customized alpaca-lora job, looking for gpus while an outsider controls the runway.

This is an iPhone 3 level event, with new millionaires being created out of the LLM equivalent of the mobile fart button.

People selling tools of any kind will make bank.