| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rekwah 841 days ago

> just put it there, it might be useful later

> Also note that we have never mentioned anything about cardinality. Because it doesn’t matter - any field can be of any cardinality. Scuba works with raw events and doesn’t pre-aggregate anything, and so cardinality is not an issue.

This is how we end up with very large, very expensive data swamps.

1 comments

_visgean 841 days ago

that depends on the sampling rate no? I would much rather have a rich log record sampled at 1% than more records that dont contain enough info to debug..

link

kiitos 841 days ago

It is a tragedy of the current generation of observability systems that they have inculcated the notion that telemetry data should be sampled. Absolute nonsense.

link

growse 841 days ago

The people feeling the pain of (and paying for) the expensive data swamp are often not the same people who are yolo'ing the sample rate to 100% in their apps, because why wouldn't you want to store every event?

Put another way, you're in charge of a large telemetry event sink. How do you incentivise the correct sampling behaviour by your users?

link

Spivak 841 days ago

Don't let the user pick the sampling rate. In Honeycomb land this is called the EMA Dynamic Sampler.

https://docs.honeycomb.io/manage-data-volume/refinery/sampli...

link

kiitos 841 days ago

You should never need to sample telemetry data.

link

gtirloni 841 days ago

Metrics sample rate yes but logging sample? When an end-to-end transaction for a very important task breaks, do I get *some* breadcrumbs to debug it?

link

_visgean 839 days ago

I have used that approach before with sentry. It was a non-issue. It depends on nature of the project of course, we had a system that was running every second so if it failed it generated a lot of data..

link

goosejuice 841 days ago

I agree. Sampling logs.. sounds dangerous. Obviously every system is different.

At least in GCP you can apply a filter to prevent ingestion and set different expiries on log budgets. This can help control costs without missing important entries.

link

isburmistrov 841 days ago

Sampling can be smart, e.g. based on some field all events have (can be called traceId, haha).

link