Hacker News new | ask | show | jobs
by zoogeny 1000 days ago
One thing about logging and tracing is the inevitable cost (in real money).

I love observability probably more than most. And my initial reaction to this article is the obvious: why not both?

In fact, I tend to think more in terms of "events" when writing both logs and tracing code. How that event is notified, stored, transmitted, etc. is in some ways divorced from the activity. I don't care if it is going to stdout, or over udp to an aggregator, or turning into trace statements, or ending up in Kafka, etc.

But inevitably I bump up against cost. For even medium sized systems, the amount of data I would like to track gets quite expensive. For example, many tracing services charge for the tags you add to traces. So doing `trace.String("key", value)` becomes something I think about from a cost perspective. I worked at a place that had a $250k/year New Relic bill and we were avoiding any kind of custom attributes. Just getting APM metrics for servers and databases was enough to get to that cost.

Logs are cheap, easy, reliable and don't lock me in to an expensive service to start. I mean, maybe you end up integrating splunk or perhaps self-hosting kibana, but you can get 90% of the benefits just by dumping the logs into Cloudwatch or even S3 for a much cheaper price.

10 comments

FWIW part of the reason you're seeing that is, at least traditionally, APM companies rebranding as Observability companies stuffed trace data into metrics data stores, which becomes prohibitively expensive to query with custom tags/attributes/fields. Newer tools/companies have a different approach that makes cost far more predictable and generally lower.

Luckily, some of the larger incumbents are also moving away from this model, especially as OpenTelemetry is making tracing more widespread as a baseline of sorts for data. And you can definitely bet they're hearing about it from their customers right now, and they want to keep their customers.

Cost is still a concern but it's getting addressed as well. Right now every vendor has different approaches (e.g., the one I work for has a robust sampling proxy you can use), but that too is going the way of standardization. OTel is defining how to propagate sampling metadata in signals so that downstream tools can use the metadata about population representativeness to show accurate counts for things and so on.

> Newer tools/companies have a different approach that makes cost far more predictable and generally lower.

What newer tools/companies are in this category? Any that you recommend?

I haven't used anything else, but I'll gladly shill for https://honeycomb.io.
I think we fit in that bucket [1] - open source, self-hostable, based on OpenTelemetry and backed by Clickhouse DB (columnar, not time-series).

Clickhouse gives users much greater flexibility in tradeoffs than either a time-series or inverted-index based store could offer (along with S3 support). There's nothing like a system that can balance high performance AND (usable) high cardinality.

[1] https://github.com/hyperdxio/hyperdx

disclaimer (in case anyone just skimmed): I'm one of the authors of HyperDX

Companies like https://signoz.io/ are Opentelemetry native and have very transparent approach to predictable pricing. You can self host easily as well.
Any of the Clickhouse-based Otel stores can dump the traces to s3 for long-term storage, and can be self-hosted. I know the following use CH: https://uptrace.dev/ https://signoz.io/ https://github.com/hyperdxio/hyperdx
I have made use of tracing, metrics, and logging all together and find each of them have its own place, as well as synergies of being able to work with all three together.

Cost is a real issue, and not just in terms of how much the vendor costs you. When tracing becomes a noticeable fraction of CPU or memory usage relative to the application, it's time to rethink doing 100% sampling. In practice, if you are sampling thousands of requests per second, you're very unlikely to actually look through each one of those thousands (thousands of req/s may not be a lot for some sites, but it is already exceeding human-scale without tooling). In order to keep accurate, useful statistics with sampling, you end up using metrics to store trace metrics prior to sampling.

> In fact, I tend to think more in terms of "events" when writing both logs and tracing code.

They are events[1]. For my text editor, KeenWrite, events can be logged either to the console when run from the command-line or displayed in a dialog when running in GUI mode. By changing "logger.log()" statements to "event.publish()" statements, a number of practical benefits are realized, including:

* Decoupled logging implementation from the system (swap one line of code to change loggers).

* Publish events on a message bus (e.g., D-Bus) to allow extending system functionality without modifying the existing code base.

* Standard logging format, which can be machine parsed, to help trace in-field production problems.

* Ability to assign unique identifiers to each event, allowing for publication of problem/solution documentation based on those IDs (possibly even seeding LLMs these days).

[1]: https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/

But events that another system relies upon are now an API. Be careful not to lock together things that are only superficially similar, as it affects your ability to change them independently.
Architecturally, the decoupling works as follows:

    Event -> Bus -> UI Subscriber -> Dialog (table)
    Event -> Bus -> Log Subscriber -> Console (text)
    Event -> Bus -> D-Bus Subscriber -> Relay -> D-Bus -> Publish (TCP/IP)
With D-Bus, published messages are versioned, allowing for API changes without breaking third-party consumers. The D-Bus Subscriber provides a layer of isolation between the application and the published messages so that the two can vary independently.
Observability costs feel high when everything’s working fine. When something snaps and everything is down and you need to know why in a hurry… those observability premiums you’ve been paying all along can pay off fast.
As other posters have mentioned, the incument companies rebranding to Observability definitely are expensive, because they are charging in the same way as they do for logs and/or metrics: per entry and per unique dimension (metrics especially).

Honeycomb at least charges per event, which in this case means per span - however they don't charge per span attribute, and each span can be pretty large (100kb / 2000 attributes).

I run all my personal services in their free tier, which has plenty of capacity, and that's before I do any sampling.

How does one break into the industry though? I worked in a project tangentially related and the problem is that sales were done by corporate sales man rather than on technicality. The companies buying the product didn't care because the people involved were making "deals". The company selling the product didn't care about making the product better because it was selling and having high AWS bills sounds like they were doing something (even though they were burning money).
You don't have to keep traces for long though.

Log for long term, traces for short debut and analisys is a fine compromise.

> I mean, maybe you end up integrating splunk or perhaps self-hosting kibana

I think this is the issue. Both Splunk and OpenSearch (even self-hosted OpenSearch) get really pricy as well especially with large volumes of log data. Cloudwatch can also get ludicrously expensive. They charge something like $0.50 per GB (!) and another $0.03 per GB to store. I've seen situations at a previous employer where someone accidentally deployed a lambda function with debug logging and ran up a few thousand $$ in Cloudwatch bills overnight.

You should look at Coralogix (disclaimer: I work there). We've built a platform that allows you to store your observability data in S3 and query it through our infrastructure. It can be dramatically more cost-effective than other providers in this space.

Why would you ever have lockin if you're using open telemetry?