Hacker News new | ask | show | jobs
by perching_aix 337 days ago
I agree with this. Logging, as well as metrics and tracing, are such hard topics for me to wrap my head around though.

From the log consumer (person) perspective, you'd want logs to provide you with sufficient information when troubleshooting. But since trouble usually happens when things go wrong in unexpected ways, the logging likely won't be well aligned to emit the right info for you to figure out what's going wrong exactly. What then, are you supposed to log the entire application state and every change to it? But then that's way too expensive, and there's a decent chance you might just drown in the noise instead. So you're left with this half artform half science type deal.

One thing I'm grateful for is that over the years most everything now logs in JSON lines at least. I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start, but then I can immediately see ways how that would be quickly rendered lot less useful early on (e.g. "this and that field can contain some other serialized JSON object, good luck!").

4 comments

Everything is events. The problem is that, as you notice, you frequently encounter situations where there are too many events to handle. Metrics, logging, and tracing are just three different ways to handle that problem.

Metrics handles too many events by aggregating them. You handle too many events by squashing them into a smaller number of events that aggregate the information.

Logging handles too many events by sampling them. If you have N times as many events as you can handle, take 1 in N of them or whatever other sampling model you want.

Tracing is logging, but where you have chains of correlated events. If you have a request started and a request ended event, it is pretty useless to get one without the other. So, you sample at the "chain of correlated events" level. You want 1 in N "chains of correlated events".

But, if you have enough throughput for all your events, just get yourself a big pile of events and throw it into a visualizer. Or better yet, just enable time travel debugging tracing so you do not need to even need to figure out how the events map to your program state.

> What then, are you supposed to log the entire application state and every change to it?

For replayability/state reconstruction, usually it's enough to log the input data and the decisions made upon them i.e. which branches of the if/switch (and things morally equivalent to them e.g. virtual functions and short-circuiting Boolean operators) you've actually taken.

> But then that's way too expensive,

Yes, it's usually still way too expensive. But when it's not, it does give you information about at what code point exactly the "wrong" decision was made, and from there you can at least start thinking about how the system could get into the state where it would start making "wrong" decisions at this precise point of code — and that usually cuts down the number of possible reasons tremendously.

> I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start ...

While not an industry standard, an open source specification for JSON log entries commonly used is ECS[0]. There are others, but this one can serve a system well IMHO.

0 - https://www.elastic.co/docs/reference/ecs/ecs-guidelines

My personal answer to this is logging very little during normal operation and then logging a lot during errors. Depending on the maturity of the system “a lot” might mean the entire state so I can debug afterwords.