|
|
|
|
|
by jldugger
806 days ago
|
|
Not manually, it just requires you to be able to group them along a dimension of interest. For example, if I get a page from us-east-1a, I can compare all the logs from that against us-east-1b. Or, you can group all the logs from the hour after the incident started to the hour a day ago (or a week ago). I pulled this technique from canary analysis and applied it to production outage analysis. In canary, you have a guaranteed random stable population that lets you perform accurate comparisons. Elsewhere, we can try to make that assumption but it might break down. For example, regional holidays can radically alter customer behavior over time or between regions. So it's not perfect but it's often good enough to provide me insights while on call. And, it requires advanced log queries to perform all these filtering, grouping, counting and scoring functions. |
|
What tools are you using? For me it's often just grep and awk with temp files, maybe a touch of python occasionally.