Hacker News new | ask | show | jobs
by lmeyerov 701 days ago
As users of otel, we are looking at reusing otel for our LLM stack, and as it is easy to instrument, don't need a new framework for that part.

However, the more interesting part is the storage: Imagine ingesting 100pg PDFs or 1M tweets, and doing many/big LLM map/reduce with big (128K+) context. In observability land, we generally have small payloads, sample data, and retire data... and backends + pricing assumes that. In LLMs, we instead might want some hot, rest in the DWH, and store everything.

How have folks been dealing with these kind of mismatches? Eg, Clickhouse backends for otel? Something else? Small stuff in otel and big stuff manually in a doc store / s3 json / parquet?

2 comments

You're right. We faced those same issues. So we plan to move those prompts and completions to be sent as log events with some reference to the trace/span and not actually on the span.

The span can then only contain the most important data like the prompt template, model that was used, token usage, etc. You can then split the metadata (spans and traces) and the large payloads (prompts + completions) to different data stores.

At Portkey, this is a problem we deal with quite a bit. Also the reason that Datadog and the traditional observability vendors did not work for LLM use cases since they're not built to handle large volumes of data.

We've done this through a careful combination of Clickhouse + MinIO for fast retrieval of log items + selected retrieval from the MinIO buckets.

Cost becomes a very big factor when managing, filtering and searching through TBs of data even for fairly small use cases.

One thing we lost in the process is full-text search over the request & response pairs and while we try to intelligently add metadata to requests to make searching easier, it isn't the complete experience yet. Still WIP as a problem statement to solve and maybe the last straw here. Any suggestions?

Clickhouse has text + vector indexes, so that may be native, though we have never used them and I find vector indexes tricky to scale w other DBs. Text... Or neither... may be enough in practice tho as we mostly only care about searching on metadata dimensions like task.

We are thinking about sampled hot data for ops staff in otel DB+UIs, and long-term full data in S3/Clickhouse for custom tooling. It'd be cool if we could send Clickhouse historical otel sessions to grafana etc on demand, but likely a bridge too far...

I think you can (pretty) easily set this up with an otel collector and something that replays data from S3 - there's a native implementation that converts otel to clickhouse
Our scenario would be more like using Clickhouse / a dwh for session cohort/workflow filtering and then populating otel tools for viz goodies. Interestingly, to your point, the otel python exporter libs are pretty simple, so SQL results -> otel spans -> Grafana temp storage should be simple!