Hacker News new | ask | show | jobs
by pchunduri6 971 days ago
This is really useful! Using LLM-assisted evaluation seems like the way to go for evaluating RAG applications. One issue I've faced while evaluating responses using GPT-4 is that the evaluation cost can go out of hand rather quickly. Do you have any measures in place or ideas on how to handle this?
1 comments

Unfortunately, right now the LLM cost is just a fundamental issue. I think it is hard to get around because comparing answer quality usually involves understanding the question and answer itself which is a task that's really well suited to LLMs.

One thing we have considered is some forms of evaluation could be replaced simply with using the embeddings of the question, context, and answer instead of using the LLM model for analysis. The idea is you could compare all the embeddings to get a rough idea of the performance based on similarity. That should in theory reduce costs. The only other alternative is just to use less advanced models which are cheaper.