Hacker News new | ask | show | jobs
by alew1 1123 days ago
But the model ultimately still has to process the comma, the newline, the "job". Is the main time savings that this can be done in parallel (on a GPU), whereas in typical generation it would be sequential?
1 comments

Yes. If you look at the biggest models on OpenAI and Anthropic apis, the prompt tokens are significantly cheaper than the response tokens.