| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by antirez 465 days ago
	Download the model in background. Serve the client with an LLM vendor API just for the first requests, or even using that same local LLM installed on your own servers (likely cheaper). By doing so, in the long run the inference cost is near-zero and allows to use LLMs in otherwise impossible business models (like freemium).

2 comments

manmal 465 days ago

Personally, I only use locally run models when I absolutely can’t have the prompt/context uploaded to a cloud. For anything else, I just use one of the commercial cloud hosted models. The ones I‘m using are way faster and better in _every_ way except privacy. Eg if you are ok to spend more, you can get blazing fast DeepSeek v3 or R1 via OpenRouter. Or, rather cheap Claude Sonnet via Copilot (pre-release also has Gemini 2.5 Pro btw).

I’ve gotten carried away - I meant to express that using cloud as a fallback for local models is something I absolutely don’t want or need, because privacy is the whole and only point to local models.

link

aazo11 465 days ago

Exactly. Why does this not exist yet?

link

byyoung3 465 days ago

its an if statement on whether the model has downloaded or not

link

aazo11 465 days ago

A better solution would train/finetune the smaller model from the responses of the larger model and only push to the inference to the edge if the smaller model is performant and the hardware specs can handle the workload?

link

monoid73 465 days ago

yeah, that'd b nice, some kind of self-bootstrapping system where you start with a strong cloud model, then fine-tune a smaller local one over time until it’s good enough to take over. tricky part is managing quality drift and deciding when it's 'good enough' without tanking UX. edge hardware's catching up though, so feels more feasible by the day.

link