| Let me half hijack to ask a related question: I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer. The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer. I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job. Any tips on how people measure the performance/effectiveness for these types of problems? |
What I can recommend is to take the coffee tasting approach. Don't try and test and evaluate individual responses, instead lock the seed used in generation, and use the same prompt for two different runs. Change one variable and do a relative comparison of the two outputs. The variables probably worth testing for you off the top of my head:
* Choice of models and/or tunes
* System prompts
* Temperature of the model against your queries
* Threshold for similarity for document inclusions (you only want relevant documents from your RAG, set it too low and you'll get some extra distractions, too high and useful information might be left out of the context).
If you setup a system to track the comparisons either automatically or by hand that just indicates which side of the change worked better for your use case, and test that same change for a bunch of different prompts you should be able to tally up whether the control or change was more preferred.
Keep those data points! The data points are your bench log and can be invaluable later on for anything you do with the system to see what changed in aggregate, what had the most outsized impact, etc and can guide you to build useful tooling for testing or finding existing solutions out there.