Hacker News new | ask | show | jobs
by jldugger 808 days ago
> maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.

This is still a valid use case but pretend for a minute you have thousands or millions of log lines to inspect. Even after filtering for ERROR level only, you still have too many "those are normal" errors, devs swear (but do not fix). And maybe the data you need to diagnose isn't even in ERROR!

The solution? Use log queries to compare a normal and abnormal process or cluster, group them by some kind of fingerprint, then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal. This lets me rapidly identify problems at scale that would otherwise take hours of pouring through logs to exclude stuff by hand.

This works any time you can divide logs into "good" and "bad." Example scenarios:

- canary analysis, comparing canary and baseline

- single faulty pod in a deploy, comparing the bad container to the n good ones

- one AZ or region in a multi-region deploy

- now versus yesterday, or versus an hour ago, etc

- Android versus iPhone

1 comments

  > then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal
I would love to hear more about this process.
The simplest technique, and the one I currently use, is just "(n+bad)/(n+good)" where n is basically the strength of a prior belief that bad/good = 1. At some level I think this might replicate TF-IDF[1] but I haven't sat down to prove it or find where they diverge.

[1]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

But this still requires you to classify each line manually to determine bad or good, no?
Not manually, it just requires you to be able to group them along a dimension of interest. For example, if I get a page from us-east-1a, I can compare all the logs from that against us-east-1b. Or, you can group all the logs from the hour after the incident started to the hour a day ago (or a week ago).

I pulled this technique from canary analysis and applied it to production outage analysis. In canary, you have a guaranteed random stable population that lets you perform accurate comparisons. Elsewhere, we can try to make that assumption but it might break down. For example, regional holidays can radically alter customer behavior over time or between regions. So it's not perfect but it's often good enough to provide me insights while on call.

And, it requires advanced log queries to perform all these filtering, grouping, counting and scoring functions.

OK, I'm starting to see where you're going with this - I also compare incident-affected logs with pre-incident logs, or HTTP requests that I want to debug with similar requests that are known good.

What tools are you using? For me it's often just grep and awk with temp files, maybe a touch of python occasionally.

Currently, splunk. But that was expensive before the acquisition, so im sure someone will come along and suggest i replace it with opensearch or something.