Hacker News new | ask | show | jobs
by drewcoo 874 days ago
For issues in the field you really want metrics and logs. That way it's easy to monitor for the state of things and to zoom in on the specific data you need when you're investigating. OMG right now! Or days or weeks from now. With a single entity or local group or a distributed set of them. Even if you're investigating a single system, you may want to correlate with other events in other systems leading to, simultaneous with, or soon following your incident. When people talk about o11y (observability) they mean this.

Ideally, events will be recoverable, but also still debug-able. Depending on the kind of thing you're looking at you may not have the (somewhat dubious) luxury of a core dump.

I'm still on the fence about whether a core dump or a Java exception unwind is more useful for new staff awakened up by a "pager" at 4 am. /s