|
|
|
|
|
by matt_lee
884 days ago
|
|
Thanks! Max's cofounder chiming in here. 1) There's an interesting subtlety in the phrase "the correct question-answer pairs". While we don't often find factually incorrect pairs because of how we're running the pipeline, the bigger question is whether or not the pairs we generated are "the" correct ones -- if they are relevant and helpful. This takes some manual tweaking at the moment. Inconsistent outputs over different runs are definitely an issue, but most teams we've worked with barely even have the CI/CD practice to be able to measure that rigorously. As we mature we'll aim to tackle flakiness of tests (and models) over time, but a bigger challenge has been getting regular tests like these set up in the first place. 2) In this scenario, we go to the documents powering a RAG application to both generate and grade answers. For example, the knowledge base might know that (1) product A is being recalled, and (2) customer #4 is asking for a warranty claim on product A. Using those two bits of information, we might generate a scenario that tests whether or not customer #4 gets the claim fulfilled. In other words, specific user information is simulated/used during the test set creation. |
|