| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vermorel 880 days ago

Does any of those LLM-as-a-service companies provide a mechanism to "save" a given input? Paying only for the state storage and the extra input when continuing the completion from the snapshot?

Indeed, at 1M token and $15/M tokens, we are talking of $10+ API calls (per call) when maxing out the LLM capacity.

I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

Right now, only ChatGPT (the webapp) seems to be using such those snapshots.

Am I missing something?

5 comments

msp26 879 days ago

> I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

If you don't care about latency or can wait to set up a batch of inputs in one go there's an alternative method. I call it batch prompting and pretty much everything we do at work with gpt-4 uses this now. If people are interested I'll do a proper writeup on how to implement it but the general idea is very straightforward and works reliably. I also think this is a much better evaluation of context than needle in a haystack.

Example for classifying game genres from descriptions.

Default:

[Prompt][Functions][Examples][game description]

- >

{"genre": [genre], "sub-genre": [sub-genre]}

Batch Prompting:

[Prompt][Functions][Examples]<game1>[description]</game><game2>[description]</game><game3>[description]</game>...

- >

{"game1": {...}, "game2": {...}, "game3": {...}, ...}

hobofan 879 days ago

I attempted similar mechanics multiple times in the past, but always ditched them, as there was always a non-negligable amount of cross-contamination happening between the individual instances you are batching. That caused so much of a headache that it wasn't really worth it.

msp26 879 days ago

Yeah that's definitely a risk with language models but it doesn't seem to be too bad for my use cases. Can I ask what tasks you used it for?

I don't really intend for this method to be final. I'll switch everything over to finetunes at some point. But this works way better than I would have expected so I kept using it.

hobofan 879 days ago

One thing I tried using it for was for a summarization/reformulation tasks, where it did RAG of ~3-4 smallish (~single sentence) documents per instance where each should be in the end form a coherent sentence. There, batching either caused one of the facts to slip into an adjacent instance or two instances to be merged into one.

Another thing I used it for was data extraction, where I extracted units of measurements and other key attributes out of descriptions from classifieds listings (my SO and me were looking for a cheap used couch). Non-batched it performed very well, while in the batched mode, it either mixed dimensions of multiple listings or after the summary for the initial listing it just gave nulls for all following listings.

vermorel 879 days ago

Agreed, some problem here.

lmeyerov 879 days ago

Yes: That's essentially their fine-tuning offerings. They rewrite some weights in the top layers based on your input, and save+serve that for you.

It sounds like you would like a wrapped version tuned just for big context.

(As others write, RAG versions are also being supported, but they're less fundamentally similar. RAG is about preprocessing to cut the input down to relevant bits. RAG + an agent framework does get closer again tho by putting this into a reasoning loop.)

brokensegue 879 days ago

Fine tuning is not great for the use case of long documents. RAG is closer

phillipcarter 880 days ago

FWIW the use case you're describing is very often achievable with RAG. Embedding models are deterministic, so while you're still limited by the often-nondeterministic nature of the LLM, in practice you can usually get the same answer for the same input. And it's substantially cheaper to do.

vermorel 879 days ago

With 1M tokens, if snapshotting the LLM state is cheap, it would beat out-of-the-box nearly all RAG setups, except the ones dealing with large datasets. 1M tokens is a lot of docs.

phillipcarter 879 days ago

Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.

ethbr1 880 days ago

How would that work technically, from a cost of goods sold perspective? (honestly asking, curious)

vermorel 880 days ago

The "cost" is storing the state of the LLM after processing the input. My back-of-the-envelop guesstimate gives me 1GB to capture the 8bit state of 70B parameters model (I might be wrong though, insights are welcome), which is quite manageable with NVMe storage for fast reload. The operator would charge per pay per "saved" prompt, plus maybe a fix per call fee to re-load the state.

YetAnotherNick 879 days ago

My calculation of kv cache gives 1GB per 3000 tokens for fp16. I am surprised openAI competitors haven't done this. This kind of features have not so niche uses, where prefix data could be cached.

FergusArgyll 879 days ago

That's a great idea! It would also open up the possibility for very long 'system prompts' on the side of the company, so they could better fine-tune their guardrails

cjbprime 880 days ago

I think the answer's in the original question: the provider has to pay for extra storage to cache the model state at the prompt you're asking to snapshot. But it's not necessarily a net increase in costs for the provider, because in exchange for doing so they (as well as you) are getting to avoid many expensive inference rounds.

datadrivenangel 880 days ago

Isn't the expensive part keeping the tokenized input in memory?

chessgecko 879 days ago

The problem is that it’s probably often not a lot cheaper. Most of the high end gpus have comparatively little bandwidth over pcie (that you’d need to use to store the context on a nvme for example). The cost there would scale with length too so you wouldn’t necessarily save more in that situation either. I think if you used a small enough gqa ratio and you knew for sure you would reuse the weights it could work, but my suspicion is that in general it would just be cheaper to recalculate.