| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nibab 930 days ago

Even though they offer inference, training is their primary focus.

Inference hasn't really picked up revenue-wise (across the space) comparing to training and it's not a great market to be in. As you mentioned, it's crowded and the barriers to entry are minimal. Anyone with experience in spinning up containers and scaling them can offer this servicer. Paradoxically, it's also the market where the big cloud providers are very well positioned to dominate. Spiky and unpredictable workloads is where their bread and butter is. Their whole economic and infrastructural model is pretty much tailored to this traffic pattern.

Training is a totally different ball game. It is a model that is disruptive to big cloud providers given that it follows very different traffic patterns. Training LLMs involves spinning up 100-1000s of machines for a relatively short period of time and with interconnect that doesn't typically exist in data centers. That is a very unique workload. Additionally you need significantly more specialized ML knowledge in tensor parallelism, optimizations, CUDA etc. That is not as common as scaling a container based workload..

Fun fact: Oracle is surprisingly well positioned in terms of their interconnect fabric. Even Microsoft is partnering with CoreWeave for GPU clusters because they dont have as much capacity interconnected in the right way.

2 comments

omeze 930 days ago

This makes sense, but I find it hard to believe the big cloud players won’t have the datacenter skills to compete…

I agree that supply is an issue, but paradoxically the fact that these GPU cloud providers (CoreWeave et al) are partnering with the big cloud players says that the big cloud players are where people would prefer to buy. Once supply constraints are solved, these providers would need some novel offering beyond “we have hardware”, e.g. some specialized distributed training framework. But MS/Google/AWS are also building their frameworks so…

And then the elephant in the room is: compute spend so imbalanced on training vs inference. Why? Is it that there arent enough real use cases? Is it that improvements are so frequent it makes sense to toss out older versions? Is it that privately trained models are a requirement for the highest spenders? My impression is that a lot of corporate spend at scaleups is purely speculative r&d to evaluate capabilities but thats a small sample from friends

boiler_up800 930 days ago

The interconnect comment here is spot on.