Hacker News new | ask | show | jobs
by matt_lee 884 days ago
Thanks! Max's cofounder chiming in here.

1) There's an interesting subtlety in the phrase "the correct question-answer pairs". While we don't often find factually incorrect pairs because of how we're running the pipeline, the bigger question is whether or not the pairs we generated are "the" correct ones -- if they are relevant and helpful. This takes some manual tweaking at the moment.

Inconsistent outputs over different runs are definitely an issue, but most teams we've worked with barely even have the CI/CD practice to be able to measure that rigorously. As we mature we'll aim to tackle flakiness of tests (and models) over time, but a bigger challenge has been getting regular tests like these set up in the first place.

2) In this scenario, we go to the documents powering a RAG application to both generate and grade answers. For example, the knowledge base might know that (1) product A is being recalled, and (2) customer #4 is asking for a warranty claim on product A. Using those two bits of information, we might generate a scenario that tests whether or not customer #4 gets the claim fulfilled. In other words, specific user information is simulated/used during the test set creation.

1 comments

I think this could be more useful to most people as a prompt/RAG testing service rather than an llm testing service. If I ran a test and found out the llm I was using is 60% accurate on some topic what would I do with this knowledge - build a more accurate llm? Switch to another? On the other hand if a service offered me suggestions to improve accuracy by providing a score for various prompt or RAG inputs, I think this would be very useful to many people. It could even uncover a general prompting strategy depending on the underlying Llm or inputs available which would be really useful