Even in the simplest of applications where all you’re doing is passing “last user query” + “retrieved articles” into openAI (and nothing else that is different between users, like previous queries or user data that may be necessary to answer), this will be a bad experience in many cases.
Queries A and B may have similar embeddings (similar topic) and it may be correct to retrieve the same articles for context (which you could cache), but they can still be different questions with different correct answers.
Depends on the scenario. In a threaded query, or multiple queries from the same user - you’d want different outputs. If 20 different users are looking for the same result - a cache would return the right answer immediately for no marginal cost.
Even in the simplest of applications where all you’re doing is passing “last user query” + “retrieved articles” into openAI (and nothing else that is different between users, like previous queries or user data that may be necessary to answer), this will be a bad experience in many cases.
Queries A and B may have similar embeddings (similar topic) and it may be correct to retrieve the same articles for context (which you could cache), but they can still be different questions with different correct answers.