| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jldugger 806 days ago

Not manually, it just requires you to be able to group them along a dimension of interest. For example, if I get a page from us-east-1a, I can compare all the logs from that against us-east-1b. Or, you can group all the logs from the hour after the incident started to the hour a day ago (or a week ago).

I pulled this technique from canary analysis and applied it to production outage analysis. In canary, you have a guaranteed random stable population that lets you perform accurate comparisons. Elsewhere, we can try to make that assumption but it might break down. For example, regional holidays can radically alter customer behavior over time or between regions. So it's not perfect but it's often good enough to provide me insights while on call.

And, it requires advanced log queries to perform all these filtering, grouping, counting and scoring functions.

1 comments

dotancohen 805 days ago

OK, I'm starting to see where you're going with this - I also compare incident-affected logs with pre-incident logs, or HTTP requests that I want to debug with similar requests that are known good.

What tools are you using? For me it's often just grep and awk with temp files, maybe a touch of python occasionally.

link

jldugger 804 days ago

Currently, splunk. But that was expensive before the acquisition, so im sure someone will come along and suggest i replace it with opensearch or something.

link