Hacker News new | ask | show | jobs
by cytzol 1303 days ago
I found this "best practice" curious to read:

> The standard net/http server violates this advice and recovers panics from request handlers. Consensus among experienced Go engineers is that this was a historical mistake. If you sample server logs from application servers in other languages, it is common to find large stacktraces that are left unhandled. Avoid this pitfall in your servers.

I don't think I've ever seen a server library — HTTP or otherwise — that didn't have a top-level "catch all exceptions" or "recover from panic" step in place, so that if there's a problem, it can return 500 (or the Internal Server Error equivalent) to the user and then carry on serving other requests.

My reasoning is that any panic-worthy programming error is almost certainly going to be in the "business logic" part of the server, rather than the protocol-parsing "deal with the network" part, and thus, recoving from a panic caused by processing a request is "safe". One incoming request could cause a panic, but the next request may touch completely unrelated parts of the program, and still be processed as normal. Furthermore, returning a 500 error but having nobody read the stacktrace is bad, yes, but it's way, way, way better than having your server crash meaning nobody can use it ever.

Oh wait, is the assumption here that your service is being run under Borg and has another 1000 instances running ready to jump in and take the crashed one's place? Is this another case of Google forgetting that people use Go outside of Google, or am I reading too much into this?

9 comments

The question: what is the state of your server after a handler panics? The answer: you have no idea. It is not wise to continue serving requests when you may have serious issues in the state of your server. Maybe some central data structure is now corrupt. You have no way of knowing. Fail fast and fail hard.

OTOH, maybe your priorities are different and you would prefer to be more available than correct. In that case by all means add a recover to the top level of your request handlers. But it was a mistake to have made this decision for all users of net/http ahead of time.

In my experience, the panic is most likely because someone accessed a nil field when adapting some data. Nothing is corrupt, we just threw an exception in a mundane way.

The reality is this is far more common than something truly fatal. Mistake for someone or not, it probably is correct for most use cases

Yes, why they made that decision back then. However in the end the lib cannot and shouldn't know, and this decision needs to be intentionally done by the middleware. Correct for most use cases is not good enough here, should be always correct for such fundamental thing.
You're assuming that the code was written exception safely. E.g.

    mu.Lock()
    foo := bar[baz]    // <- throws exception / panics
    mu.Unlock()
Go is sold as a language without exceptions, so people don't write exception-safe code. Which is fine, except when exceptions are actually caught.
That wouldn't pass a code review where I work... Use a defer to do the unlock
I would also not allow it. I'm saying the problem is that core Go developers say "Go doesn't have exceptions", which is manifestly false, but causes people to not write exception safe code.

But despite you and me, I'm saying there's a lot of broken code out there because of this doesn't-but-actually-does misinformation.

And it's very annoying that you have to tell people to do:

    var i int
    func() {
      mu.Lock()
      defer mu.Unlock()
      i = foo[bar]
    }()
Clean code, that is not. (even if you simplify it by having the lambda return the int)
This is like the biggest thing scaring me away from Go. This half-assed "we don't use exceptions, so you shouldn't have to care about it, except when we do, so you still must write proper defers, which are now doubly verbose because nobody considered it a primary use case"... In any other language, a mutex outside a using/with/try-with/RAII would be instantly flagged in code-review or linter tools. In many cases even hard to write incorrectly, due to entering context being only way to acquire the lock.

Now this middle ground leaves you having to write triple verbose if err != null on every third line of your code and still not be safe from panics-that-shouldnt-have-been-panics.

As parent says, the only way panics can ever work is if the top-level never catches and recovers from them. I'm no expert in go but that would mean in such perfect world, defer should hardly ever be needed at all, not even for locks? Only for truly external resources? But now with popular web servers doing such recovery, the entire ecosystem got polluted and all need to handle it?

Does defer get called on panic? I thought panic cancels all further execution
Yes it does, which is why recovering from a panic can be done in a deferred function. The go runtime maintains enough metadata to track what deferred functions need to be run while unwinding the stack.
The main issues that usually arise as a result of catching panics in handlers like HTTP is that unless code it written very deliberately to be able to recover from being interrupted in the middle of any function call, there is a high risk of e.g. mutexes left locked, which in turn leads to a (silent) program deadlock in general.

This has happened during my couple years at Google at least once, even though it wasn't in an HTTP handler, but the issue was very similar.

> most likely

> probably is correct

Yeah I don't know...

But that implies your request checking is off; if your server expects a field to be filled in and a client doesn't send it, that should be a HTTP 400 Bad Request error, not an internal server error.

C/C++ based servers running into an error like that would possibly be open for attacks. Go will be a bit more resilient, but it's still better to avoid situations like that.

That said, logging the error and going on serving responses is fine I think (pragmatic), as long as the error is analyzed. But an error that doesn't trigger immediate action is a warning, and warnings are noise [0].

[0] https://dave.cheney.net/2015/11/05/lets-talk-about-logging#:...

> In my experience, the panic is most likely because someone accessed a nil field when adapting some data. Nothing is corrupt, we just threw an exception in a mundane way.

Even more so if a database is involved (which is generally the case), because odds are the transaction just gets rolled back and there's basically nothing that could be corrupted.

The Echo web framework is an interesting example of this [0]. By default it doesn't handle panics, but there is a configurable `Recover` middleware which does so. I agree though it's unusual, and this stuck out to me as an exception that proves the rule.

[0] https://echo.labstack.com/middleware/recover/

I believe the idea is that 'panic' is considered something fatal. If it happens, the application should die unless you have a strong reason for it not to. If you can recover e.g. by returning HTTP 500 for a single request, it should be handled by returning errors up throughout the call stack, i.e., by error handling instead of panicking.

It's definitely an opinionated approach though.

Nil pointer panics happen. When a DB column that was supposed to be non-null is null and the code unexpectedly accesses it. Oh schema won't help because legacy. There are rare though and we fix such issues as they are discovered, but just pointing out a benign case when crashing the whole server on panic may be a wrong approach. Our approach is to always recover from panics so the server keeps chugging, but raise an alarm so someone is paged and fixes the issue right away.
> I believe the idea is that 'panic' is considered something fatal.

I think opinions in 3rd party Go code on what “fatal” means vary, that’s the issue. Sure it was intended to mean “this error is so bad the entire program needs to die right now” but in practice there’s cases where it’s treated more like an unchecked exception in Java, i.e., “I can’t recover from this so _I’m_ going to give up but _you_ can keep going.” Or put another way, what a library might consider fatal the caller doesn’t. It can be argued whether or not that’s the right thing to do, but the fact is it happens in the wild, so for something like a server you probably should trap it.

> but in practice there’s cases where it’s treated more like an unchecked exception in Java, i.e., “I can’t recover from this so _I’m_ going to give up but _you_ can keep going.”

Java has a supertype for unrecoverable problems like out-of-memory-errors. It's called "Error", a subtype from Throwable. Exceptions are also Throwables, but they are NOT errors.

There's nothing unrecoverable about Error exceptions. If one thread hit a bug and threw StackOverflowError, it's not a reason to kill the entire application or even thread itself, just unwind stack, show some error and continue processing next task. The only tricky situation I'm aware of is OutOfMemoryError, because it can affect other threads. I'd prefer to restart the server. But again it's just because typical code allocated heap all the time. One can write code carefully without heap allocations and this code will work just fine with OOM errors thrown around.
> It can be argued whether or not that’s the right thing to do

No, because both approaches could be right at a higher level, to pretake that decision at this level is definitely wrong here?!

No, that's about error handling: panic/recover in the standard net/http server is used in lieu of exception handling, which is absent in Go. Instead, they (and official Go docs as well) recommend to use "if-hell".
Presumably you have a load balancer in front of your Go code that's written using saner languages and assumptions that does handle errors correctly.
If your program fails in testing, and no-one reads the logs, does it ever get fixed?

There is one advantage to not having a top-level catch: if it fails in testing, it's very obvious immediately.

Though I do agree with you - and this can be done with a feature flag for dev/prod servers.

> Is this another case of Google forgetting that people use Go outside of Google, or am I reading too much into this?

Whether or not they are forgetting aside, this is Google’s style guide for code bases in Google. I don’t think non-Google Go programmers were a consideration for them.

On a broader note, it seems as though anytime Google publishes something people interpret it as “industry standard” (see their C++ style guide) and apply it to their non-Google projects. I personally don’t see this as healthy.

I've always thought that the ideal is somewhere in between the two.

1. Catch the panic/exception.

2. Track the rate of these panics or exceptions. If it is too high some data structure has probably been corrupted or some lock has been poisoned. If a lot of requests are failing abort.

And ideally: 3 signal that you are in a degraded state so that some external process can gracefully drain your traffic and restart you. Although very few people have this level of self-healing infrastructure set up.

I wrote a web server that handled a lot of requests, and my solution to 2. was to have it notify me via our alert chat channel every time it panicked. This was rare enough that I could investigate them individually, and if it got overwhelming (it never did) I could choose to do something fancier.
How I understand it is that it's up to the user to create a panic handler middleware, which is perfectly valid and what most people are doing anyway.

Something like: https://github.com/go-chi/chi/blob/master/middleware/recover...