| I've been looking into this question for a bit. [1] Here's my notes on evals -- Things to consider when comparing options: 1) “Types of metrics supported (only NLP metrics, model-graded evals, or both), level of customizability; supports component eval (i.e. single prompts) or pipeline evals (i.e. testing the entire pipeline, all the way from retrieval to post-processing)” 2) “+method of dataset & eval management (config vs UI), tracing to help debug failing evals” 3) “If you wanted to go deeper on evaluation, I'd probably also add: What to evaluate for: - Hallucination - Safety - Usefulness - Tone / format (eg conciseness) - Specific regressions Tips: - Model-graded evaluation is taking off - Use GPT-4, GPT-3.5 is not good enough [for evals] - Most big companies have some human oversight of the model-grading - Conversational simulation is an emerging idea building on top of model-graded eval” - AI Startup Founder --- Here are a few that people are using for evals at production scale: * Honeyhive https://honeyhive.ai * Gentrace https://gentrace.ai * Humanloop https://humanloop.com * Gantry https://www.gantry.io I've done calls with the founders of three of those four, and I've talked with enterprise customers who've been evaluating a couple of those. I see there's a few others mentioned in this thread (langfuse, truera, langkit/whylabs) that I haven't heard about from customers but also look promising. There's also langsmith which I do know is popular amongst enterprises (enterprises hear of langchain, see that they have a big enterprise-oriented offering) but I haven't talked with anyone who uses it. Then for evals at prototyping scale there are various small tools and open source tools that I've collected here: https://llm-utils.org/List+of+tools+for+prompt+engineering [1]: I'm working on an AI infra handbook. Email me, email in profile, if you can review/add comments to my draft. It's 23 pages long :x |
------
We've built all of these considerations:
1) Support for standard metrics like BLEU, ROUGE, BERT similarity, and model-graded evals; we serialize the entire LLM app call so you can test anything that happens in it (context chunks, tool calls/inputs, etc.)
2) Bc we serialize the whole record, you can use this tracing to debug failing evals. We also have chain-of-thought reasoning evals that can explain why an eval failed. Last - there's a streamlit UI you can launch (tru.run_dashboard) that'll run locally
3) Hallucination is probably the biggest problem we solve for. To do evals for hallucination, we typically see our users use a combination of groundedness (does the context support the LLM response) and context relevance (is the retrieved context relevant to the query). There's also a bunch more for the evaluations you mentioned (moderation models, sentiment, usefulness, etc.) and it's pretty easy to add custom evals.
Also - my hot take is that gpt-3.5 is good enough for evals (sometimes better) than gpt-4 if you give the LLM enough instructions on how to do the eval.
website: https://www.trulens.org/ github: https://github.com/truera/trulens