| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jiayq84 788 days ago

I do a startup called Lepton AI. We provide AI PaaS and fast AI runtimes as a service, so we keep a close eye on the IaaS supply chain. For the last few months we see supply chain getting better and better, so the business model that worked 6 months ago - "we have gpus, come buy barebone servers" no longer work. However, a bigger problem emerges. Probably a problem that could shake the industry: people don't know how to efficiently use these machines.

There are clusters of GPUs sitting idle because companies don't know how to use them. It's embarrassing to resell them too because that makes the images look bad to VCs, but secondary market is slowly happening.

Essentially, people want a PaaS or SaaS on top of the barebone machines.

For example, for the last couple months we were helping a customer to fully utilize their hundreds-of-card cluster. Their IaaS provider was new to the field. So we literally helped both sides to (1) understand infiniband and nccl and training code and stuff; (2) figure out control plane traffic; (3) built accelerated storage layer for training; (4) all kinds of subtle signals that needs attention. Do you know that a GPU can appear OK in nvidia-smi, but still encounter issues when you actually run a cuda or nccl kernel? That needs care. (5) fast software runtimes, like LLM runtime, finetuning script, and many others.

So I think AI PaaS and SaaS is going to be a very valuable (and big) market, after people come out of the frenzy of "grabbing gpus" - and now we need to use them efficiently.