| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jldugger 807 days ago
	The simplest technique, and the one I currently use, is just "(n+bad)/(n+good)" where n is basically the strength of a prior belief that bad/good = 1. At some level I think this might replicate TF-IDF[1] but I haven't sat down to prove it or find where they diverge. [1]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

1 comments

dotancohen 806 days ago

But this still requires you to classify each line manually to determine bad or good, no?

link

jldugger 806 days ago

Not manually, it just requires you to be able to group them along a dimension of interest. For example, if I get a page from us-east-1a, I can compare all the logs from that against us-east-1b. Or, you can group all the logs from the hour after the incident started to the hour a day ago (or a week ago).

I pulled this technique from canary analysis and applied it to production outage analysis. In canary, you have a guaranteed random stable population that lets you perform accurate comparisons. Elsewhere, we can try to make that assumption but it might break down. For example, regional holidays can radically alter customer behavior over time or between regions. So it's not perfect but it's often good enough to provide me insights while on call.

And, it requires advanced log queries to perform all these filtering, grouping, counting and scoring functions.

link

dotancohen 805 days ago

OK, I'm starting to see where you're going with this - I also compare incident-affected logs with pre-incident logs, or HTTP requests that I want to debug with similar requests that are known good.

What tools are you using? For me it's often just grep and awk with temp files, maybe a touch of python occasionally.

link

jldugger 804 days ago

Currently, splunk. But that was expensive before the acquisition, so im sure someone will come along and suggest i replace it with opensearch or something.

link