Hacker News new | ask | show | jobs
by crabbone 1202 days ago
> It is simple and trustworthy.

This sounds so naive for someone with 20+ years in the field...

Linux, for decades, couldn't get logging to the point that it at least doesn't lose messages (the problem with tail / logrotate that is quite obvious once you think about it, but it took many years to give up the approach).

I recently hit a bug where NVidia's driver abuses Linux kernel logging in some tight loop by spamming log messages at insane speed (happens when you have two video adapters, Intel and NVidia and an external monitor). An interesting side-effect here is that Linux logging tries to throttle loggers who output too much, so, from the log you cannot tell what's happening (because even though the system is burning calories trying to print a tonne of messages, nothing really gets printed).

Several iterations ago I worked on a product where logging had to be implemented as writes to shared memory self-styled circular buffer, and because there was too much info printed too quickly you only had few seconds worth of logs before system crash... on a good day.

Needless to mention the fun of stitching together logs coming from different places in your system with separate clocks.

Even simply processing hundreds of Gigabytes of logs on its own isn't a trivial task.

----

Many things are simple, when your task is simple. Logging is just one of those things.

1 comments

> Many things are simple, when your task is simple. Logging is just one of those things.

I agree with much of what you said, and of course "logging" is not just a single point in the solution space — there is some function "troubleshooting_pain = f(your_project, your_approach)". I was trying to say that for "your_approach=logging" that function tends to return smaller values than for "your_approach=debugging", all other things being equal, in my experience.

Whereas your comments seem more oriented towards the "your_project" factor. Of course using logs is harder on a distributed system. But so is using a debugger, or just about anything else.

Perhaps I should have said "It is relatively simple and trustworthy, even if it can still get hairy at the extremes."

Both interactive and declarative debuggers work better in distributed systems than logging because they can observe events as they happen, and don't need to recreate the order in which they happened from the records which are very hard to make chronologically consistent.

Things like EBPF (which may implement sort of a declarative debugger) are, perhaps the only tool you may hope to use in high volume and high frequency systems.

If I could only choose one technology used for software diagnostics, I'd choose debuggers over logging. Debuggers need more effort to develop them, and they aren't very good (yet), but they have potential. I don't believe that logging can be substantially improved to deal with difficult problems.

One thing that can be pretty nice which is kinda neither traditional debugging nor logging is DTrace (or similar). Basically event tracing on steroids. Maybe EBPF is in that vein? I don't have much personal experience with it, but I have heard some stories of good success on busy production systems.

I guess my (limited) experiences with distributed systems are different than yours. The notion of "pausing" the system to step through things interactively was usually untenable. Do you stop the one node that shows the issue, and let the others run, getting into who-knows-what shared state? Do you somehow attempt to stop them all, and hope/pray that they all are in the right state to make your cross-node analysis meaningful? This was mostly on Apache Spark, where parallelism was the name of the game. Maybe for some kind of long-running distributed system like Erlang it's a different story.

> Maybe EBPF is in that vein?

Not just that :) It was "inspired by". Well, it's the same idea.

> The notion of "pausing" the system to step through things interactively was usually untenable.

That's not what EBPF would be used for in such a system. You'd write a bit of code that can be loaded into a running program and executed as a particular condition occurs. Like how you can attach some code to evaluate on a breakpoint in many other debuggers.