| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ethbr1 838 days ago
	How would that work technically, from a cost of goods sold perspective? (honestly asking, curious)

2 comments

vermorel 838 days ago

The "cost" is storing the state of the LLM after processing the input. My back-of-the-envelop guesstimate gives me 1GB to capture the 8bit state of 70B parameters model (I might be wrong though, insights are welcome), which is quite manageable with NVMe storage for fast reload. The operator would charge per pay per "saved" prompt, plus maybe a fix per call fee to re-load the state.

link

YetAnotherNick 838 days ago

My calculation of kv cache gives 1GB per 3000 tokens for fp16. I am surprised openAI competitors haven't done this. This kind of features have not so niche uses, where prefix data could be cached.

link

FergusArgyll 838 days ago

That's a great idea! It would also open up the possibility for very long 'system prompts' on the side of the company, so they could better fine-tune their guardrails

link

cjbprime 838 days ago

I think the answer's in the original question: the provider has to pay for extra storage to cache the model state at the prompt you're asking to snapshot. But it's not necessarily a net increase in costs for the provider, because in exchange for doing so they (as well as you) are getting to avoid many expensive inference rounds.

link

datadrivenangel 838 days ago

Isn't the expensive part keeping the tokenized input in memory?

link