Hacker News new | ask | show | jobs
by joshreini 997 days ago
Bias up front: I'm a TruLens developer

------

We've built all of these considerations:

1) Support for standard metrics like BLEU, ROUGE, BERT similarity, and model-graded evals; we serialize the entire LLM app call so you can test anything that happens in it (context chunks, tool calls/inputs, etc.)

2) Bc we serialize the whole record, you can use this tracing to debug failing evals. We also have chain-of-thought reasoning evals that can explain why an eval failed. Last - there's a streamlit UI you can launch (tru.run_dashboard) that'll run locally

3) Hallucination is probably the biggest problem we solve for. To do evals for hallucination, we typically see our users use a combination of groundedness (does the context support the LLM response) and context relevance (is the retrieved context relevant to the query). There's also a bunch more for the evaluations you mentioned (moderation models, sentiment, usefulness, etc.) and it's pretty easy to add custom evals.

Also - my hot take is that gpt-3.5 is good enough for evals (sometimes better) than gpt-4 if you give the LLM enough instructions on how to do the eval.

website: https://www.trulens.org/ github: https://github.com/truera/trulens