| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by entrope 44 days ago
	What's the basis for saying local tokens will always be cheaper? As others have outlined, LLMs serving one user at a time are pretty expensive, but concurrent users become much more cost-effective (assuming there's enough RAM for the contexts). If "local" to you means ~10 hours daily use by a team of employees, the company still has to balance against cloud services that can amortize non-recurring costs over 24 hours of service per day.

1 comments

zozbot234 43 days ago

Why would a team of employees not be able to run AI workloads 24/7? Not all workloads are time sensitive.

link

entrope 43 days ago

Both my experience, and Anthropic's off-peak promotion, indicate that there are very uneven levels of demand for peak hours versus off-peak hours. How close do you think they are?

link

zozbot234 43 days ago

But that's demand for cloud inference that's priced on a flat-rate basis with some adjustments (like "off-peak hours"). Not a local rig where inference is effectively free aside from the cost of power whenever the system isn't congested.

link

reissbaker 43 days ago

The local rig is not free and requires very large capital expenditures while producing very low token throughput for large models. Within any time budget, you can get many orders of magnitude more large-model tokens off an 8xB200 than off a local rig. Therefore cloud tokens have a huge capital efficiency advantage over local rigs. That will continue basically forever, since there will always be large cloud companies willing to spend millions of dollars for more capital-efficient hardware, so Nvidia and friends will continue to spare no expense producing it, meaning the cloud hardware will be way too expensive if you're not a large inference company. You can also buy local rigs, but they will be less capital efficient per token, not more.

(This is a generous argument: it also ignores the massive software stack optimization the cloud companies do that doesn't trickle down to local-rig-sized deployments; for example, prefill/decode disaggregation, which would double the VRAM requirements for a local rig — if you could even do it on a local rig, which you can't, because local rigs don't have Infiniband. But at scale, prefill/decode disaggregation improves capital efficiency, since you can tune the compute-bound prefill node differently than the memory-bound decode node.)

The advantage of local rigs is not capital-efficient tokens. It's privacy. But then again, you can get zero-data-retention options from many inference companies, so for many use cases it may not matter unless you need strict guarantees the data never leaves the building...

link

zozbot234 43 days ago

> The local rig is not free and requires very large capital expenditures while producing very low token throughput for large models.

Sometimes it really is free though, because the hardware was bought to serve some other existing needs and that capital expense was fully depreciated quite some time ago. Underutilised hardware is essentially ubiquitous.

> Within any time budget, you can get many orders of magnitude more large-model tokens off an 8xB200 than off a local rig.

But using that 8xB200 setup to run inference on cheap, non-frontier models is a plain waste. Its highest and best use is in an AI datacenter serving exceptionally smart models like Gemini DeepThink, GPT Pro or Claude Mythos. (If this isn't true, it means that the current level of large-scale investment in frontier, super intelligent AI is misplaced, and you should worry about that; not whether some models are best ran on lower-end hardware!)

link

reissbaker 40 days ago

> Sometimes it really is free though, because the hardware was bought to serve some other existing needs and that capital expense was fully depreciated quite some time ago.

No one has 8xRTX Pro 6000s that have depreciated to zero "quite some time ago."

> But using that 8xB200 setup to run inference in cheap, non-frontier models is plain waste

From whose perspective? If someone wants to run an open-source model — and plenty do — someone buying or renting an 8xB200 to serve it cheaply at scale is much better than everyone buying huge amounts of pointless, wasted hardware such as 8xRTX Pro 6000s for $80,000 per person.

link