Hacker News new | ask | show | jobs
by TylerE 1057 days ago
This make anyone elses eyebrows raise sky high at this?

> Claire replied, asking for the full stacktraces for the log entries, which I was able to also extract from the logs.

This is either deep voodoo magic, or the code or configuration is turning a Xeon into the equivalent of a 286. House is that not, like, megabytes on every single hit?

4 comments

> HTTP 500 errors when viewing an account

> Stacktrace for that 500

This is the default ruby on rails behavior. It prints a stacktrace on any 500 or unknown error, and it's just line numbers and filepaths.

> megabytes on every single hit

I run a rails app that's very poorly designed.

I just checked, and the stack trace for a single 500 is 5KiB. It doesn't even add up to 1MiB a day since there's only a 500 error about every hour.

> This is either deep voodoo magic, or the code or configuration is turning a Xeon into the equivalent of a 286

Having a call stack handy is is actually pretty performant. Java's default exception behavior is to bubble up a stack trace with every exception, whether you print it or not, and java applications run just fine. You have the call stack anyway since you have to know how to return, so the only extra information you need handy is the filename and line number debug symbols, and ruby needs that info anyway just by the nature of the language.

>Java's default exception behavior is to bubble up a stack trace with every exception, whether you print it or not,

Anyone who has spent 5 minutes in Java knows exactly what this looks like. And also how unwelcoming it is to new programmers.

I have spent more than 5 minutes in java, and I don't know why having a stack available on an exception is unwelcoming to new programmers.

It definitely seems better than the approaches some other languages have, like C's "return code 1, check errno", when most new programmers don't even know what an errno is.

Can you explain it to me?

Sure, but imagine how unwelcoming to everyone it would be if you didn't get a stack trace.
Recording stacktraces of errors is a pretty reasonable thing to do. And ideally not every hit causes an error.
Do you mean you do _not_ capture stacktraces of errors in a live system ? How do you go about understanding where the error comes from ?
I log stacks for every error-level log and have never found it that useful. It's better than just logging "EOF" with no context of course, but manually annotating each frame with information not known to the caller is the way to go. Shifting to Go specifics; stack traces miss things like channel recvs and loops. Consider:

   for _, datum := range data {
       if err := DoSomethingWithDatum(datum); err != nil {
           log.Error(...)
       }
   }
In that case, the stack trace misses the most important thing: which datum failed.

Another common case:

   type Thing struct {
       Value any
       Err error
   }
   func Produce() {
       ch <- MakeThing()
   }
   func Consume() {
       for _, thing := range ch {
           if thing.Err != nil {
              log.Error(...)
           }
       }
   }
This one is easier to get right; capture the stack when MakeThing's implementation produces a Thing with err != nil. But, a lot of people just log the stack at log.Error which is basically useless. (Adding to the fun, sometimes Consume() is going to be an RPC to another service written in a different language. But you're still going to want a stack to help debug it.)

TL;DR stack traces are better than nothing, but a comprehensive way of handling errors and writing the information you need to fix it to the log is going to be more valuable. It is a lot of work, but I've always found it worthwhile.

> In that case, the stack trace misses the most important thing: which datum failed.

OK we agree that the stacktrace isn't _enough_, but it's still a really useful thing to have to understand what exactly happened (and quite often the single most useful thing). Of course we still expect devs to capture the information that led to the `log.Error`, so that we don't have to play guess games.

Rather than manually-annotated logs, I'd prefer getting rid of all logging altogether and use tracing (opentelemetry), which is precisely designed for observability.

> I'd prefer getting rid of all logging altogether

I prefer to build metrics and distributed traces from logs! But, I think we can agree that they're the same thing. It's a stream of events that a system captures and lets you retrieve. (I wrote our logging/tracing/metrics library at work, and indeed I call any sort of start/finish operation a Span. https://github.com/pachyderm/pachyderm/blob/master/src/inter...)

Post-processing into an efficient retrieval system is the key. You can tail the logs and send span starts/ends to Jaeger, and you can tail the logs and send summarized metrics to Prometheus or InfluxDB. I really like having the too-high-cardinality raw data around, though, so I can analyze specific failing requests. For example, you wouldn't want to have per-request "tx_bytes" metrics in Prometheus; the cardinality is too high and it blows up the whole system. But if you log your metrics, then you find the request (grep for lines that contain the x-request-id), and you can see exact timestamps for when every megabyte of data was sent along with everything else that happened in between. ("It took 10 seconds to send 1MB of this file?" "We did 300 point SQL queries between the 10th and 11th megabyte?" Things like that.) Meanwhile, you still have summarized data for alerts ("across all requests handled by this machine, 90% of requests are taking longer than 10 seconds").

Logs end up being a lot of data, but storing a terabyte of logs to save me a week of debugging is the biggest no-brainer in software engineering. $20 for 1 week of time saved. I'll also add that as a backend developer, logs are your software's UI. If something goes wrong, that's where the operator interfaces with your software to fix the problem. So they are not to be neglected or done without care; the same way you'd probably spell check your HTML frontend.

I'll also add, I have a weird background. When I worked on Google Fiber, we wanted the ability to add new fleet-wide CPE metrics to the system without a software push. So, we logged aggressively and turned relevant log lines into metrics on the log receiver side. (I designed and wrote that system.) That meant we could monitor for issues without taking people's Internet down for 15 minutes while their routers rebooted to apply a software update that captured a metric we wanted to monitor or alert on. At my current job, we don't operate the software we write; our users buy a license and then they do God Knows What on their own infrastructure. What that means is if I want something like Jaeger, it's on me to install it, maintain it, upgrade it, and support it in an environment that I can only access by saying commands to type on a Zoom call. The juice was worth the squeeze exactly once; for logs. Users can run "pachctl debug dump", generate a tar file, send it to us, and then we have logs, metrics, and traces without any operational effort on their end.

While I'm here, here's how I do metrics-to-logs on the producing side: https://github.com/pachyderm/pachyderm/blob/master/src/inter... Apache2 if anyone wants to steal it ;)

> I prefer to build metrics and distributed traces from logs! But, I think we can agree that they're the same thing. It's a stream of events that a system captures and lets you retrieve.

...

> Post-processing into an efficient retrieval system is the key. You can tail the logs and send span starts/ends to Jaeger, and you can tail the logs and send summarized metrics to Prometheus or InfluxDB.

I'm not sure it's the same thing; the idea that you spit everything out into this flat stream of bytes and then try to parse it back into structured data seems ass-backwards to me. I agree with keeping a trace of events that happen in your system, but if it's data you care about, don't you want to keep it structured the whole way through? At which point it's not really "logs" as usually understood.

You’ve confused a stack trace with a core dump (or something similar).