Hacker News new | ask | show | jobs
by vermorel 833 days ago
Does any of those LLM-as-a-service companies provide a mechanism to "save" a given input? Paying only for the state storage and the extra input when continuing the completion from the snapshot?

Indeed, at 1M token and $15/M tokens, we are talking of $10+ API calls (per call) when maxing out the LLM capacity.

I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

Right now, only ChatGPT (the webapp) seems to be using such those snapshots.

Am I missing something?

5 comments

> I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

If you don't care about latency or can wait to set up a batch of inputs in one go there's an alternative method. I call it batch prompting and pretty much everything we do at work with gpt-4 uses this now. If people are interested I'll do a proper writeup on how to implement it but the general idea is very straightforward and works reliably. I also think this is a much better evaluation of context than needle in a haystack.

Example for classifying game genres from descriptions.

Default:

[Prompt][Functions][Examples][game description]

- >

{"genre": [genre], "sub-genre": [sub-genre]}

Batch Prompting:

[Prompt][Functions][Examples]<game1>[description]</game><game2>[description]</game><game3>[description]</game>...

- >

{"game1": {...}, "game2": {...}, "game3": {...}, ...}

I attempted similar mechanics multiple times in the past, but always ditched them, as there was always a non-negligable amount of cross-contamination happening between the individual instances you are batching. That caused so much of a headache that it wasn't really worth it.
Yeah that's definitely a risk with language models but it doesn't seem to be too bad for my use cases. Can I ask what tasks you used it for?

I don't really intend for this method to be final. I'll switch everything over to finetunes at some point. But this works way better than I would have expected so I kept using it.

One thing I tried using it for was for a summarization/reformulation tasks, where it did RAG of ~3-4 smallish (~single sentence) documents per instance where each should be in the end form a coherent sentence. There, batching either caused one of the facts to slip into an adjacent instance or two instances to be merged into one.

Another thing I used it for was data extraction, where I extracted units of measurements and other key attributes out of descriptions from classifieds listings (my SO and me were looking for a cheap used couch). Non-batched it performed very well, while in the batched mode, it either mixed dimensions of multiple listings or after the summary for the initial listing it just gave nulls for all following listings.

Agreed, some problem here.
Yes: That's essentially their fine-tuning offerings. They rewrite some weights in the top layers based on your input, and save+serve that for you.

It sounds like you would like a wrapped version tuned just for big context.

(As others write, RAG versions are also being supported, but they're less fundamentally similar. RAG is about preprocessing to cut the input down to relevant bits. RAG + an agent framework does get closer again tho by putting this into a reasoning loop.)

Fine tuning is not great for the use case of long documents. RAG is closer
FWIW the use case you're describing is very often achievable with RAG. Embedding models are deterministic, so while you're still limited by the often-nondeterministic nature of the LLM, in practice you can usually get the same answer for the same input. And it's substantially cheaper to do.
With 1M tokens, if snapshotting the LLM state is cheap, it would beat out-of-the-box nearly all RAG setups, except the ones dealing with large datasets. 1M tokens is a lot of docs.
Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.
How would that work technically, from a cost of goods sold perspective? (honestly asking, curious)
The "cost" is storing the state of the LLM after processing the input. My back-of-the-envelop guesstimate gives me 1GB to capture the 8bit state of 70B parameters model (I might be wrong though, insights are welcome), which is quite manageable with NVMe storage for fast reload. The operator would charge per pay per "saved" prompt, plus maybe a fix per call fee to re-load the state.
My calculation of kv cache gives 1GB per 3000 tokens for fp16. I am surprised openAI competitors haven't done this. This kind of features have not so niche uses, where prefix data could be cached.
That's a great idea! It would also open up the possibility for very long 'system prompts' on the side of the company, so they could better fine-tune their guardrails
I think the answer's in the original question: the provider has to pay for extra storage to cache the model state at the prompt you're asking to snapshot. But it's not necessarily a net increase in costs for the provider, because in exchange for doing so they (as well as you) are getting to avoid many expensive inference rounds.
Isn't the expensive part keeping the tokenized input in memory?
The problem is that it’s probably often not a lot cheaper. Most of the high end gpus have comparatively little bandwidth over pcie (that you’d need to use to store the context on a nvme for example). The cost there would scale with length too so you wouldn’t necessarily save more in that situation either. I think if you used a small enough gqa ratio and you knew for sure you would reuse the weights it could work, but my suspicion is that in general it would just be cheaper to recalculate.