Hacker News new | ask | show | jobs
by pchunduri6 888 days ago
I just tried the demo, and it looks great! Congrats on the launch!

I have a couple of questions:

1) How often do you find that the LLM fails to generate the correct question-answer pairs? The biggest challenge I'm facing with LLM-based evaluation is the variability in LLM performance. I've found that the same prompt results in different LLM responses over multiple runs. Do you have any insights on this issue and how to address it?

2) Sometimes, the domain expert generating the test set might not be well-equipped to grade the answers. Consider a customer-facing chatbot application. The RAG app might be focused on very specific user information that might be hard to verify or attest by the test set creator. Do you think there are ways to make this grading process easier?

1 comments

Thanks! Max's cofounder chiming in here.

1) There's an interesting subtlety in the phrase "the correct question-answer pairs". While we don't often find factually incorrect pairs because of how we're running the pipeline, the bigger question is whether or not the pairs we generated are "the" correct ones -- if they are relevant and helpful. This takes some manual tweaking at the moment.

Inconsistent outputs over different runs are definitely an issue, but most teams we've worked with barely even have the CI/CD practice to be able to measure that rigorously. As we mature we'll aim to tackle flakiness of tests (and models) over time, but a bigger challenge has been getting regular tests like these set up in the first place.

2) In this scenario, we go to the documents powering a RAG application to both generate and grade answers. For example, the knowledge base might know that (1) product A is being recalled, and (2) customer #4 is asking for a warranty claim on product A. Using those two bits of information, we might generate a scenario that tests whether or not customer #4 gets the claim fulfilled. In other words, specific user information is simulated/used during the test set creation.

I think this could be more useful to most people as a prompt/RAG testing service rather than an llm testing service. If I ran a test and found out the llm I was using is 60% accurate on some topic what would I do with this knowledge - build a more accurate llm? Switch to another? On the other hand if a service offered me suggestions to improve accuracy by providing a score for various prompt or RAG inputs, I think this would be very useful to many people. It could even uncover a general prompting strategy depending on the underlying Llm or inputs available which would be really useful