Hacker News new | ask | show | jobs
by TrueDuality 989 days ago
For small personal projects its kind of hard to build metrics like this because the volume of indexed content in the database tends to be pretty low. If you're indexing paragraphs you might consistently be able to fit all relevant paragraphs in the context itself.

What I can recommend is to take the coffee tasting approach. Don't try and test and evaluate individual responses, instead lock the seed used in generation, and use the same prompt for two different runs. Change one variable and do a relative comparison of the two outputs. The variables probably worth testing for you off the top of my head:

* Choice of models and/or tunes

* System prompts

* Temperature of the model against your queries

* Threshold for similarity for document inclusions (you only want relevant documents from your RAG, set it too low and you'll get some extra distractions, too high and useful information might be left out of the context).

If you setup a system to track the comparisons either automatically or by hand that just indicates which side of the change worked better for your use case, and test that same change for a bunch of different prompts you should be able to tally up whether the control or change was more preferred.

Keep those data points! The data points are your bench log and can be invaluable later on for anything you do with the system to see what changed in aggregate, what had the most outsized impact, etc and can guide you to build useful tooling for testing or finding existing solutions out there.