| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tikkun 1044 days ago

I've been looking into this question for a bit. [1]

Here's my notes on evals --

Things to consider when comparing options:

1) “Types of metrics supported (only NLP metrics, model-graded evals, or both), level of customizability; supports component eval (i.e. single prompts) or pipeline evals (i.e. testing the entire pipeline, all the way from retrieval to post-processing)”

2) “+method of dataset & eval management (config vs UI), tracing to help debug failing evals”

3) “If you wanted to go deeper on evaluation, I'd probably also add:

What to evaluate for:

- Hallucination

- Safety

- Usefulness

- Tone / format (eg conciseness)

- Specific regressions

Tips:

- Model-graded evaluation is taking off

- Use GPT-4, GPT-3.5 is not good enough [for evals]

- Most big companies have some human oversight of the model-grading

- Conversational simulation is an emerging idea building on top of model-graded eval” - AI Startup Founder

---

Here are a few that people are using for evals at production scale:

* Honeyhive https://honeyhive.ai

* Gentrace https://gentrace.ai

* Humanloop https://humanloop.com

* Gantry https://www.gantry.io

I've done calls with the founders of three of those four, and I've talked with enterprise customers who've been evaluating a couple of those.

I see there's a few others mentioned in this thread (langfuse, truera, langkit/whylabs) that I haven't heard about from customers but also look promising. There's also langsmith which I do know is popular amongst enterprises (enterprises hear of langchain, see that they have a big enterprise-oriented offering) but I haven't talked with anyone who uses it.

Then for evals at prototyping scale there are various small tools and open source tools that I've collected here: https://llm-utils.org/List+of+tools+for+prompt+engineering

[1]: I'm working on an AI infra handbook. Email me, email in profile, if you can review/add comments to my draft. It's 23 pages long :x

4 comments

joshreini 1043 days ago

Bias up front: I'm a TruLens developer

------

We've built all of these considerations:

1) Support for standard metrics like BLEU, ROUGE, BERT similarity, and model-graded evals; we serialize the entire LLM app call so you can test anything that happens in it (context chunks, tool calls/inputs, etc.)

2) Bc we serialize the whole record, you can use this tracing to debug failing evals. We also have chain-of-thought reasoning evals that can explain why an eval failed. Last - there's a streamlit UI you can launch (tru.run_dashboard) that'll run locally

3) Hallucination is probably the biggest problem we solve for. To do evals for hallucination, we typically see our users use a combination of groundedness (does the context support the LLM response) and context relevance (is the retrieved context relevant to the query). There's also a bunch more for the evaluations you mentioned (moderation models, sentiment, usefulness, etc.) and it's pretty easy to add custom evals.

Also - my hot take is that gpt-3.5 is good enough for evals (sometimes better) than gpt-4 if you give the LLM enough instructions on how to do the eval.

website: https://www.trulens.org/ github: https://github.com/truera/trulens

link

tikkun 1044 days ago

Then there's also monitoring. My notes from the monitoring section are below:

When needed: “it goes hand-in-hand with eval, as you need to be able to turn bad prod generations into failing eval cases for eng to make pass”

Considerations:

1) “Ability to monitor custom metrics (ROUGE [1], Coherence, etc.) and slice-and-dice data - customizability; non-intrusive logging vs proxies, VPC (enterprise-readiness). Plenty of tools w/ basic cost, latency monitoring; very few w/ enterprise-grade customizability, anomaly detection, etc.”

2) “+agent/pipeline tracing, ability to re-purpose data for fine tuning, connection to user feedback, man-in-the-middle approach (proxy) vs SDK integration (we believe SDK is superior so your monitoring vendor can go down without taking down your LLM feature)”

Companies for LLM monitoring: Helicone, Honeyhive, Gentrace, Humanloop, Langsmith, Pezzo

[1]: https://en.wikipedia.org/wiki/ROUGE_(metric)

link

batshit_beaver 1044 days ago

I'd suggest looking into WhyLabs. They've got anomaly detection, lightweight SDK, complete data privacy, and ability to ingest custom metrics: https://docs.whylabs.ai/docs/start-here

link

tikkun 1044 days ago

Do you have an association with them or just a happy user? Either is fine of course

link

batshit_beaver 1044 days ago

Former employee, yes. Stumbled across this thread, thought I'd chime in. Didn't realize how many other folks are working on tools for this problem!

link

sourabh03agr 1042 days ago

You can add UpTrain to the list. We are building an open-source LLM evaluation tool with pre-built evaluations such as factual accuracy, retrieval quality, response completeness, tonality, etc. as well as easily extendable framework which allows LLM developers to define their custom evaluations by chaining individual operators. Checkout our demo here: https://demo.uptrain.ai/evals_demo/

link

fogx 1044 days ago

Do you have any links to conversational simulation?

link

tikkun 1044 days ago

Here's the note I have on that: “For chatbot interfaces, emerging approach is to have another agent simulating the user (as opposed to a more classic approach based on token prediction probs on chat transcripts, what I think you're referencing). Then still use a model for grading. Only place I've seen this so far: https://github.com/Forethought-Technologies/AutoChain/blob/m... ” - AI Startup Founder

link