Hacker News new | ask | show | jobs
by sshumaker 759 days ago
They are almost certainly doing this internally for their own chat products.

The simple version of this just involves saving off the KV cache in the attention layers, and restore it back instead of recomputing. It only requires small changes to inference and the attention layers.

The main challenge is being able to do this under scale, e.g. dump the weights out of GPU memory, persist them, and have a system to rapidly reload them as needed (or just regenerate).

1 comments

2024 is the year of serverless LLM?