Hacker News new | ask | show | jobs
by stackskipton 337 days ago
As SRE/DevOps/Ops whatever, I'm screaming.

Metrics should be emitted in separate stream and never by logs outside corner cases. Logs should be used to determine WHY the system is having issues but never IS the system having issues.

Log alerting is a fools errand that looks like a great idea at start but quickly becomes a sand trap that will drive future people crazy and at scale, will overwhelm systems.

Why is log alerting bad idea?

Every log becomes a metric point that must be dealt with. Therefore, the logging system must be kept operational and error free. However, due to other problems below, this system quickly becomes a beast of it's own.

Logs are generally much bigger then KV of <Metric> <Value> so there ends up being a ton of filtering going on in logging system, adding to the load.

Logging system probably does not understand rates so you end up writing gnarly queries to be like "Is this first unhandled exception?" in 10m or my 50th in 10m. Query in Prometheus is much much simpler.

Each language logging library handles things in different way so organization must be on point to either A) Keep log format the same between all different languages. B) Teach the logging system how to manipulate each log into format that can be handled by alerting system. Obviously A causes massive developer friction and B causes massive Ops friction.

Finally, I find people doing logging tend not handle exceptions as well because they can just trust logging system to alert them on specific problem and deal with it manually.

So for future Ops person who has to deal with your code, I'm begging you, import prometheus_client.

1 comments

I've noticed that for some reason developers really like using logs in place of actual metrics. We use Datadog, and multiple times now I have seen devs add additional logging to an application just so they can then create a monitor that counts those log events. I think it's a path of least resistance thing; emitting logs is very easy, and counting them is also very easy. Reporting actual metrics isn't really difficult either, but unless you're already familiar with the system it's more effort to determine how to do it than just emitting a log line, so yeah.
Because when the application is breaking it's good to know why! Logs can be just as ephemeral as metrics -- in many cases, even more so. They're not even mutually exclusive.

Where exactly does this anti-logs sentiment come from? Is it because tools like datadog can be lackluster for reading logs across bunches of hosts?

For me, I don't use Datadog so it's not that $ParticularTool does not work with logs, it's all stuff I put in my original post, it's a ton of samples, filtering puts heavy strain on the systems and it's extremely brittle IME.

If you have good metrics, you can generally get much further not even logging aggregating outside tossing everything into STDOUT and checking on it when you have alerts.

It GP is not anti-log.

It's claiming that you should output working logs, metrics, and failure logs into different streams.

My experience is that metrics may tell you something is wrong, but logs are required to tell you what went wrong and why.

A simple fixed-length rolling buffer can get you pretty far for logging, but it isn't something you necessarily want to get off-device except when something bad has happened.

logs are for people and trying to make them for computers is hard.
Well, if you have low enough volume, you have to implement logs anyway and don't have a reason to optimize their volume.

I imagine many people learn on an environment like this and get thrown in a high volume one without chance to adapt.