|
|
|
|
|
by pchunduri6
971 days ago
|
|
This is really useful! Using LLM-assisted evaluation seems like the way to go for evaluating RAG applications. One issue I've faced while evaluating responses using GPT-4 is that the evaluation cost can go out of hand rather quickly. Do you have any measures in place or ideas on how to handle this? |
|
One thing we have considered is some forms of evaluation could be replaced simply with using the embeddings of the question, context, and answer instead of using the LLM model for analysis. The idea is you could compare all the embeddings to get a rough idea of the performance based on similarity. That should in theory reduce costs. The only other alternative is just to use less advanced models which are cheaper.