Hacker News new | ask | show | jobs
How are generative AI companies monitoring their systems in production?
18 points by wujerry2000 996 days ago
Companies with LLM-based products (specifically Retrieval-Augmented Generation)deployed in production - how are you monitoring outputs for hallucinations? What's your process?
8 comments

We struggled with this ourselves while building LLM-based products and then open-sourced our observability/monitoring tool [1]. Many use it to track RAG and agents in production, run custom evals on the production traces (focused on hallucination), and track how metrics are different across releases or customers. Feel free to dm if there is something specific you are looking to solve, happy to help.

[1] https://github.com/langfuse/langfuse

There are quite a few LLM monitoring tools in the market. But for monitoring (or evaluating) RAG systems, I found Ragas to be the most helpful: https://blog.langchain.dev/evaluating-rag-pipelines-with-rag...
Langkit is an option. WhyLabs has published a number of blog posts on this subject recently: https://whylabs.ai/blog/posts/monitoring-llm-performance-wit...
we built https://klu.ai/ for this

======

outside of us, here's what I see happening

80% of folks aren't building in prod

if you pull apart the 20% that are building, I've seen this from largest to smallest population:

1. most people are not monitoring, followed by 2. home-grown solutions logged into existing observe/analytics platforms, followed by 3. LLMOps tooling like Klu

the 2 cents on the unfortunate truth: I think that many of the AI bolt-on features are living the classic feature lifecycle in that they are launched, no one is monitoring them for improvement, and the feature retention sucks so there's no top-down push to prioritize. the people measuring and improving are exceptional builders regardless of LLMs/RAG.

Also very interested in this question. We are looking at Truera for observability.
Hey @sirspacey - I'm a TruLens/TruEra dev. If we're not in contact already, feel free to shoot me an email or join our slack community if you need assistance or want to talk through how to best leverage TruEra for this.

email: josh.reini@truera.com slack: https://communityinviter.com/apps/aiqualityforum/josh

It's crucial for AI companies to monitor their systems in production continuously. Not only does this ensure the system's performance and reliability, but it also helps in identifying and addressing any issues or biases that may arise.

Many AI companies use a combination of real-time monitoring, automated alerts, and regular audits to maintain the quality and fairness of their AI systems. It's an ongoing process that plays a vital role in responsible AI development.

In case you have an AI project in mind, feel free to contact us! https://www.ratherlabs.com

We are also looking for a solution for the same.. Currently evaluating LangSmith for the same.
I've been looking into this question for a bit. [1]

Here's my notes on evals --

Things to consider when comparing options:

1) “Types of metrics supported (only NLP metrics, model-graded evals, or both), level of customizability; supports component eval (i.e. single prompts) or pipeline evals (i.e. testing the entire pipeline, all the way from retrieval to post-processing)”

2) “+method of dataset & eval management (config vs UI), tracing to help debug failing evals”

3) “If you wanted to go deeper on evaluation, I'd probably also add:

What to evaluate for:

- Hallucination

- Safety

- Usefulness

- Tone / format (eg conciseness)

- Specific regressions

Tips:

- Model-graded evaluation is taking off

- Use GPT-4, GPT-3.5 is not good enough [for evals]

- Most big companies have some human oversight of the model-grading

- Conversational simulation is an emerging idea building on top of model-graded eval” - AI Startup Founder

---

Here are a few that people are using for evals at production scale:

* Honeyhive https://honeyhive.ai

* Gentrace https://gentrace.ai

* Humanloop https://humanloop.com

* Gantry https://www.gantry.io

I've done calls with the founders of three of those four, and I've talked with enterprise customers who've been evaluating a couple of those.

I see there's a few others mentioned in this thread (langfuse, truera, langkit/whylabs) that I haven't heard about from customers but also look promising. There's also langsmith which I do know is popular amongst enterprises (enterprises hear of langchain, see that they have a big enterprise-oriented offering) but I haven't talked with anyone who uses it.

Then for evals at prototyping scale there are various small tools and open source tools that I've collected here: https://llm-utils.org/List+of+tools+for+prompt+engineering

[1]: I'm working on an AI infra handbook. Email me, email in profile, if you can review/add comments to my draft. It's 23 pages long :x

Bias up front: I'm a TruLens developer

------

We've built all of these considerations:

1) Support for standard metrics like BLEU, ROUGE, BERT similarity, and model-graded evals; we serialize the entire LLM app call so you can test anything that happens in it (context chunks, tool calls/inputs, etc.)

2) Bc we serialize the whole record, you can use this tracing to debug failing evals. We also have chain-of-thought reasoning evals that can explain why an eval failed. Last - there's a streamlit UI you can launch (tru.run_dashboard) that'll run locally

3) Hallucination is probably the biggest problem we solve for. To do evals for hallucination, we typically see our users use a combination of groundedness (does the context support the LLM response) and context relevance (is the retrieved context relevant to the query). There's also a bunch more for the evaluations you mentioned (moderation models, sentiment, usefulness, etc.) and it's pretty easy to add custom evals.

Also - my hot take is that gpt-3.5 is good enough for evals (sometimes better) than gpt-4 if you give the LLM enough instructions on how to do the eval.

website: https://www.trulens.org/ github: https://github.com/truera/trulens

Then there's also monitoring. My notes from the monitoring section are below:

When needed: “it goes hand-in-hand with eval, as you need to be able to turn bad prod generations into failing eval cases for eng to make pass”

Considerations:

1) “Ability to monitor custom metrics (ROUGE [1], Coherence, etc.) and slice-and-dice data - customizability; non-intrusive logging vs proxies, VPC (enterprise-readiness). Plenty of tools w/ basic cost, latency monitoring; very few w/ enterprise-grade customizability, anomaly detection, etc.”

2) “+agent/pipeline tracing, ability to re-purpose data for fine tuning, connection to user feedback, man-in-the-middle approach (proxy) vs SDK integration (we believe SDK is superior so your monitoring vendor can go down without taking down your LLM feature)”

Companies for LLM monitoring: Helicone, Honeyhive, Gentrace, Humanloop, Langsmith, Pezzo

[1]: https://en.wikipedia.org/wiki/ROUGE_(metric)

I'd suggest looking into WhyLabs. They've got anomaly detection, lightweight SDK, complete data privacy, and ability to ingest custom metrics: https://docs.whylabs.ai/docs/start-here
Do you have an association with them or just a happy user? Either is fine of course
Former employee, yes. Stumbled across this thread, thought I'd chime in. Didn't realize how many other folks are working on tools for this problem!
You can add UpTrain to the list. We are building an open-source LLM evaluation tool with pre-built evaluations such as factual accuracy, retrieval quality, response completeness, tonality, etc. as well as easily extendable framework which allows LLM developers to define their custom evaluations by chaining individual operators. Checkout our demo here: https://demo.uptrain.ai/evals_demo/
Do you have any links to conversational simulation?
Here's the note I have on that: “For chatbot interfaces, emerging approach is to have another agent simulating the user (as opposed to a more classic approach based on token prediction probs on chat transcripts, what I think you're referencing). Then still use a model for grading. Only place I've seen this so far: https://github.com/Forethought-Technologies/AutoChain/blob/m... ” - AI Startup Founder