Hacker News new | ask | show | jobs
by jldugger 807 days ago
The simplest technique, and the one I currently use, is just "(n+bad)/(n+good)" where n is basically the strength of a prior belief that bad/good = 1. At some level I think this might replicate TF-IDF[1] but I haven't sat down to prove it or find where they diverge.

[1]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

1 comments

But this still requires you to classify each line manually to determine bad or good, no?
Not manually, it just requires you to be able to group them along a dimension of interest. For example, if I get a page from us-east-1a, I can compare all the logs from that against us-east-1b. Or, you can group all the logs from the hour after the incident started to the hour a day ago (or a week ago).

I pulled this technique from canary analysis and applied it to production outage analysis. In canary, you have a guaranteed random stable population that lets you perform accurate comparisons. Elsewhere, we can try to make that assumption but it might break down. For example, regional holidays can radically alter customer behavior over time or between regions. So it's not perfect but it's often good enough to provide me insights while on call.

And, it requires advanced log queries to perform all these filtering, grouping, counting and scoring functions.

OK, I'm starting to see where you're going with this - I also compare incident-affected logs with pre-incident logs, or HTTP requests that I want to debug with similar requests that are known good.

What tools are you using? For me it's often just grep and awk with temp files, maybe a touch of python occasionally.

Currently, splunk. But that was expensive before the acquisition, so im sure someone will come along and suggest i replace it with opensearch or something.