| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by iampims 600 days ago
	Or sampling :)

1 comments

craigching 600 days ago

Sampling is lossy though

link

iampims 600 days ago

lossy and simpler.

IME, I've found sampling simpler to reason about, and with the sampling rate part of the message, deriving metrics from logs works pretty well.

The example in the article is a little contrived. Healthchecks often originate from multiple hosts and/or logs contain the remote address+port, leading to each log message being effectively unique. So sure, one could parse the remote address into remote_address=192.168.12.23 remote_port=64780 and then decide to drop the port in the aggregation, but is it worth the squeeze?

link

kiitos 600 days ago

If a service emits a log event, then that log event should be visible in your logging system. Basic stuff. Sampling fails this table-stakes requirement.

link

eru 600 days ago

Typically, you store your most recent logs in full, and you can move to sampling for older logs (if you don't want to delete them outright).

link

kiitos 600 days ago

It's reasonable to drop logs beyond some window of time -- a year, say -- but I'm not sure why you'd ever sample log events. Metric samples, maybe! Log data, no point.

But, in general, I think we agree -- all good!

link

eru 599 days ago

> It's reasonable to drop logs beyond some window of time -- a year, say [...]

That's reasonable in a reasonable environment. Alas, I worked in large legacy enterprises (like banks etc) where storage space is at much more of a premium for reasons.

You are right that sampling naively works better for metrics.

For logs you can still sample, but in a saner way: so instead of dropping each log line with an independent probability, you'll want correlation. Eg for each log file for each hour only flip one weighted coin to decide whether you want to keep the whole thing.

link