Hacker News new | ask | show | jobs
by turtledragonfly 1202 days ago
> Many things are simple, when your task is simple. Logging is just one of those things.

I agree with much of what you said, and of course "logging" is not just a single point in the solution space — there is some function "troubleshooting_pain = f(your_project, your_approach)". I was trying to say that for "your_approach=logging" that function tends to return smaller values than for "your_approach=debugging", all other things being equal, in my experience.

Whereas your comments seem more oriented towards the "your_project" factor. Of course using logs is harder on a distributed system. But so is using a debugger, or just about anything else.

Perhaps I should have said "It is relatively simple and trustworthy, even if it can still get hairy at the extremes."

1 comments

Both interactive and declarative debuggers work better in distributed systems than logging because they can observe events as they happen, and don't need to recreate the order in which they happened from the records which are very hard to make chronologically consistent.

Things like EBPF (which may implement sort of a declarative debugger) are, perhaps the only tool you may hope to use in high volume and high frequency systems.

If I could only choose one technology used for software diagnostics, I'd choose debuggers over logging. Debuggers need more effort to develop them, and they aren't very good (yet), but they have potential. I don't believe that logging can be substantially improved to deal with difficult problems.

One thing that can be pretty nice which is kinda neither traditional debugging nor logging is DTrace (or similar). Basically event tracing on steroids. Maybe EBPF is in that vein? I don't have much personal experience with it, but I have heard some stories of good success on busy production systems.

I guess my (limited) experiences with distributed systems are different than yours. The notion of "pausing" the system to step through things interactively was usually untenable. Do you stop the one node that shows the issue, and let the others run, getting into who-knows-what shared state? Do you somehow attempt to stop them all, and hope/pray that they all are in the right state to make your cross-node analysis meaningful? This was mostly on Apache Spark, where parallelism was the name of the game. Maybe for some kind of long-running distributed system like Erlang it's a different story.

> Maybe EBPF is in that vein?

Not just that :) It was "inspired by". Well, it's the same idea.

> The notion of "pausing" the system to step through things interactively was usually untenable.

That's not what EBPF would be used for in such a system. You'd write a bit of code that can be loaded into a running program and executed as a particular condition occurs. Like how you can attach some code to evaluate on a breakpoint in many other debuggers.