| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zwaps 557 days ago

Why is it that all of the many eval and LLM ops offerings spent seemingly all their energy on UI and playgrounds?

When it comes to tracking, tracing and versioning the entire LLM callchain, so from prompt, to response models, model and workflow code and code gen/exec artifacts, it’s just not there. A basic solution based on OpenTelemetry for some subset of an LLM app is easy tondo, heck even I have written one. But what use is that?

Like, how many instrumentations save prompt, IO and model settings without the orchestration code or agent/rag flow? How does this help any production level LLM use case?

What is this application where i am just using bare LLM promoting and RAG without any custom logic, but I need a tracing solution and a collaborative prompt playground? I have yet to see it.

Unless we can trace and version everything that actually influences the final LLM call, there is no use in a standardized framework and we need to roll a bespoke solution for every case. We try often and it always comes down to this.

Build something that allows me to trace, evaluate and track everything, allow for deployment in customer tenants and on prem, and you have it.

Stop spending your time on prompt UIs and playgrounds. We code. Our LLM apps are code, lots of it! Make the foundation of your framework solid first, then worry about turning temperature knobs in a user interface.

1 comments

dsaffy 557 days ago

Yeah we totally agree. That's why we work on the end-to-end app, not on a single prompt. You pick what parameters become knobs in the frontend. So if you have a giant app with 10 parameters (say 5 prompts, 5 numbers), great, wrap those and they become knobs on our frontend. We override during the actual end-to-end testing execution kinda like a Statsig / Launchdarkly (only with typing).

link