| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sshumaker 759 days ago

They are almost certainly doing this internally for their own chat products.

The simple version of this just involves saving off the KV cache in the attention layers, and restore it back instead of recomputing. It only requires small changes to inference and the attention layers.

The main challenge is being able to do this under scale, e.g. dump the weights out of GPU memory, persist them, and have a system to rapidly reload them as needed (or just regenerate).

1 comments

ethbr1 759 days ago

2024 is the year of serverless LLM?

link