Hacker News new | ask | show | jobs
by tikkun 996 days ago
Then there's also monitoring. My notes from the monitoring section are below:

When needed: “it goes hand-in-hand with eval, as you need to be able to turn bad prod generations into failing eval cases for eng to make pass”

Considerations:

1) “Ability to monitor custom metrics (ROUGE [1], Coherence, etc.) and slice-and-dice data - customizability; non-intrusive logging vs proxies, VPC (enterprise-readiness). Plenty of tools w/ basic cost, latency monitoring; very few w/ enterprise-grade customizability, anomaly detection, etc.”

2) “+agent/pipeline tracing, ability to re-purpose data for fine tuning, connection to user feedback, man-in-the-middle approach (proxy) vs SDK integration (we believe SDK is superior so your monitoring vendor can go down without taking down your LLM feature)”

Companies for LLM monitoring: Helicone, Honeyhive, Gentrace, Humanloop, Langsmith, Pezzo

[1]: https://en.wikipedia.org/wiki/ROUGE_(metric)

1 comments

I'd suggest looking into WhyLabs. They've got anomaly detection, lightweight SDK, complete data privacy, and ability to ingest custom metrics: https://docs.whylabs.ai/docs/start-here
Do you have an association with them or just a happy user? Either is fine of course
Former employee, yes. Stumbled across this thread, thought I'd chime in. Didn't realize how many other folks are working on tools for this problem!