Hacker News new | ask | show | jobs
by Ephil012 970 days ago
Pretty cool tutorial. As a side note, it is pretty hard to evaluate these pipelines for quality once you build them since there's not many standard practices yet given how new this all is. If it's helpful to anyone else, we built a free open source tool within my company that is basically a collection of premade metrics for determining the quality of these pipelines. https://github.com/TonicAI/tvalmetrics
1 comments

This is really useful! Using LLM-assisted evaluation seems like the way to go for evaluating RAG applications. One issue I've faced while evaluating responses using GPT-4 is that the evaluation cost can go out of hand rather quickly. Do you have any measures in place or ideas on how to handle this?
Unfortunately, right now the LLM cost is just a fundamental issue. I think it is hard to get around because comparing answer quality usually involves understanding the question and answer itself which is a task that's really well suited to LLMs.

One thing we have considered is some forms of evaluation could be replaced simply with using the embeddings of the question, context, and answer instead of using the LLM model for analysis. The idea is you could compare all the embeddings to get a rough idea of the performance based on similarity. That should in theory reduce costs. The only other alternative is just to use less advanced models which are cheaper.