|
|
|
|
|
by zozbot234
42 days ago
|
|
But that's demand for cloud inference that's priced on a flat-rate basis with some adjustments (like "off-peak hours"). Not a local rig where inference is effectively free aside from the cost of power whenever the system isn't congested. |
|
(This is a generous argument: it also ignores the massive software stack optimization the cloud companies do that doesn't trickle down to local-rig-sized deployments; for example, prefill/decode disaggregation, which would double the VRAM requirements for a local rig — if you could even do it on a local rig, which you can't, because local rigs don't have Infiniband. But at scale, prefill/decode disaggregation improves capital efficiency, since you can tune the compute-bound prefill node differently than the memory-bound decode node.)
The advantage of local rigs is not capital-efficient tokens. It's privacy. But then again, you can get zero-data-retention options from many inference companies, so for many use cases it may not matter unless you need strict guarantees the data never leaves the building...