Hacker News new | ask | show | jobs
by burtonator 672 days ago
Autoregressive models can't just resume so they have to re-parse the entire prompt again for each execution.

By caching them they resume from where it left off from before thereby completely bypassing all that computation.

For large contexts this could save a ton of compute!

I think this feature and structured outputs are some of the biggest inventions in LLMs this year.

1 comments

Prompt caching has been a thing for LLMs since GPT-2 (e.g. transformers's `use_past=True`), it's more of a surprise that it took this long for the main LLM providers to provide a good implementation.
I’m building an app with OpenAI, using structured outputs. Does OpenAI also support prompt caching?
I'm sure internally they use it for the system prompt at least, probably since launch. And maybe for common initial user queries that exactly match.
They are certainly not passing the savings on to the users.
Yet. I suspect OpenAI will release a similar offering soon. (hooray, free market competition!)
That $100 billion data center has to get paid for somehow.
Not currently.