|
|
|
|
|
by burtonator
672 days ago
|
|
Autoregressive models can't just resume so they have to re-parse the entire prompt again for each execution. By caching them they resume from where it left off from before thereby completely bypassing all that computation. For large contexts this could save a ton of compute! I think this feature and structured outputs are some of the biggest inventions in LLMs this year. |
|