| As SRE/DevOps/Ops whatever, I'm screaming. Metrics should be emitted in separate stream and never by logs outside corner cases. Logs should be used to determine WHY the system is having issues but never IS the system having issues. Log alerting is a fools errand that looks like a great idea at start but quickly becomes a sand trap that will drive future people crazy and at scale, will overwhelm systems. Why is log alerting bad idea? Every log becomes a metric point that must be dealt with. Therefore, the logging system must be kept operational and error free. However, due to other problems below, this system quickly becomes a beast of it's own. Logs are generally much bigger then KV of <Metric> <Value> so there ends up being a ton of filtering going on in logging system, adding to the load. Logging system probably does not understand rates so you end up writing gnarly queries to be like "Is this first unhandled exception?" in 10m or my 50th in 10m. Query in Prometheus is much much simpler. Each language logging library handles things in different way so organization must be on point to either A) Keep log format the same between all different languages. B) Teach the logging system how to manipulate each log into format that can be handled by alerting system. Obviously A causes massive developer friction and B causes massive Ops friction. Finally, I find people doing logging tend not handle exceptions as well because they can just trust logging system to alert them on specific problem and deal with it manually. So for future Ops person who has to deal with your code, I'm begging you, import prometheus_client. |