|
|
|
|
|
by dragonwriter
375 days ago
|
|
My understanding is that the implementation of modern hosted LLMs is nondeterministic even with known seed because the generated results are sensitive to a number of other factors including, but not limited to, other prompts running in the same batch. |
|
> Now, when you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit. We will dynamically pass cost savings back to you, providing the same 75% token discount.
> In order to increase the chance that your request contains a cache hit, you should keep the content at the beginning of the request the same and add things like a user's question or other additional context that might change from request to request at the end of the prompt.
From https://news.ycombinator.com/item?id=43939774 re: same:
> Does this make it appear that the LLM's responses converge on one answer when actually it's just caching?