Hacker News new | ask | show | jobs
by williamdclt 1057 days ago
Do you mean you do _not_ capture stacktraces of errors in a live system ? How do you go about understanding where the error comes from ?
1 comments

I log stacks for every error-level log and have never found it that useful. It's better than just logging "EOF" with no context of course, but manually annotating each frame with information not known to the caller is the way to go. Shifting to Go specifics; stack traces miss things like channel recvs and loops. Consider:

   for _, datum := range data {
       if err := DoSomethingWithDatum(datum); err != nil {
           log.Error(...)
       }
   }
In that case, the stack trace misses the most important thing: which datum failed.

Another common case:

   type Thing struct {
       Value any
       Err error
   }
   func Produce() {
       ch <- MakeThing()
   }
   func Consume() {
       for _, thing := range ch {
           if thing.Err != nil {
              log.Error(...)
           }
       }
   }
This one is easier to get right; capture the stack when MakeThing's implementation produces a Thing with err != nil. But, a lot of people just log the stack at log.Error which is basically useless. (Adding to the fun, sometimes Consume() is going to be an RPC to another service written in a different language. But you're still going to want a stack to help debug it.)

TL;DR stack traces are better than nothing, but a comprehensive way of handling errors and writing the information you need to fix it to the log is going to be more valuable. It is a lot of work, but I've always found it worthwhile.

> In that case, the stack trace misses the most important thing: which datum failed.

OK we agree that the stacktrace isn't _enough_, but it's still a really useful thing to have to understand what exactly happened (and quite often the single most useful thing). Of course we still expect devs to capture the information that led to the `log.Error`, so that we don't have to play guess games.

Rather than manually-annotated logs, I'd prefer getting rid of all logging altogether and use tracing (opentelemetry), which is precisely designed for observability.

> I'd prefer getting rid of all logging altogether

I prefer to build metrics and distributed traces from logs! But, I think we can agree that they're the same thing. It's a stream of events that a system captures and lets you retrieve. (I wrote our logging/tracing/metrics library at work, and indeed I call any sort of start/finish operation a Span. https://github.com/pachyderm/pachyderm/blob/master/src/inter...)

Post-processing into an efficient retrieval system is the key. You can tail the logs and send span starts/ends to Jaeger, and you can tail the logs and send summarized metrics to Prometheus or InfluxDB. I really like having the too-high-cardinality raw data around, though, so I can analyze specific failing requests. For example, you wouldn't want to have per-request "tx_bytes" metrics in Prometheus; the cardinality is too high and it blows up the whole system. But if you log your metrics, then you find the request (grep for lines that contain the x-request-id), and you can see exact timestamps for when every megabyte of data was sent along with everything else that happened in between. ("It took 10 seconds to send 1MB of this file?" "We did 300 point SQL queries between the 10th and 11th megabyte?" Things like that.) Meanwhile, you still have summarized data for alerts ("across all requests handled by this machine, 90% of requests are taking longer than 10 seconds").

Logs end up being a lot of data, but storing a terabyte of logs to save me a week of debugging is the biggest no-brainer in software engineering. $20 for 1 week of time saved. I'll also add that as a backend developer, logs are your software's UI. If something goes wrong, that's where the operator interfaces with your software to fix the problem. So they are not to be neglected or done without care; the same way you'd probably spell check your HTML frontend.

I'll also add, I have a weird background. When I worked on Google Fiber, we wanted the ability to add new fleet-wide CPE metrics to the system without a software push. So, we logged aggressively and turned relevant log lines into metrics on the log receiver side. (I designed and wrote that system.) That meant we could monitor for issues without taking people's Internet down for 15 minutes while their routers rebooted to apply a software update that captured a metric we wanted to monitor or alert on. At my current job, we don't operate the software we write; our users buy a license and then they do God Knows What on their own infrastructure. What that means is if I want something like Jaeger, it's on me to install it, maintain it, upgrade it, and support it in an environment that I can only access by saying commands to type on a Zoom call. The juice was worth the squeeze exactly once; for logs. Users can run "pachctl debug dump", generate a tar file, send it to us, and then we have logs, metrics, and traces without any operational effort on their end.

While I'm here, here's how I do metrics-to-logs on the producing side: https://github.com/pachyderm/pachyderm/blob/master/src/inter... Apache2 if anyone wants to steal it ;)

> I prefer to build metrics and distributed traces from logs! But, I think we can agree that they're the same thing. It's a stream of events that a system captures and lets you retrieve.

...

> Post-processing into an efficient retrieval system is the key. You can tail the logs and send span starts/ends to Jaeger, and you can tail the logs and send summarized metrics to Prometheus or InfluxDB.

I'm not sure it's the same thing; the idea that you spit everything out into this flat stream of bytes and then try to parse it back into structured data seems ass-backwards to me. I agree with keeping a trace of events that happen in your system, but if it's data you care about, don't you want to keep it structured the whole way through? At which point it's not really "logs" as usually understood.