Hacker News new | ask | show | jobs
by sigmoid10 62 days ago
That's not how it works. The system prompt doesn't "get calculated first" or anything. You combine it with the user prompt and then run the generation for the first new token on that thing, which basically boils down to one huge matmul that runs in parallel. So you can literally just cache a part of the input matrices for the first step and then you'll very quickly run into n^2 complexity.
1 comments

The system prompt will always match in the prefix cache. I just meant it could be prefilled before any user queries on completely different hardware. Then you are only dealing with the n^2 only for the actual user prompt. We're in agreeance I think.