Hacker News new | ask | show | jobs
by Ramone 4493 days ago
It's actually pretty silly to write node.js apps in a crash-only manner. Since they often handle thousands of connections concurrently, a failure in one client's processing is pretty horrendous if it brings down a whole server (even if the server thread gets immediately restarted). This project doesn't try to solve that either though.
1 comments

In the tradeoff between a short post and a long elaboration, I erred on the side of being too terse.

Node applications can be split between stateful servers (ex. chat) and stateless (ex. API for mobile clients).

It's the stateless servers where some developers write fault tolerant / fail fast / fast restart applications. Doing so in a stateful server would be counter productive.

Also, this does not mean to imply that developers are writing sloppy code that fails constantly or that they fail to implement proper error handling. What I was stating is that unhandled exceptions are unexpected, but when they do occur they indicate something is seriously wrong. Importantly, this puts the application's state into an unknown state, which is difficult to recover from. In such situations, a robust approach is to let Node fail, restart fast, and have clean state.

The reasons this is a sound approach are:

1. If the failure is due to a memory leak, then the graphs will highlight said leak clearly

2. The unhandled exception indicates that something is very wrong. A server restart is easily seen in the logs and is a warning that deeper inspection is necessary

3. Recovering state after an unhandled exception is difficult. In a stateless server its better to just restart from a clean state. This assumes the engineers wrote the application to work from a clean state (i.e. after a restart there is no need to recreate state)

4. A fault tolerant architecture is good practice as disks can fail, CPUs can fail, network connections can fail, etc. In a cluster failure is expected and applications are architected to continue operation in the face of failure

I actually think I understood you, but I'm saying that in that case where statelessness should be an advantage, node.js is actually a much less fault tolerant environment when you compare it to most other web application servers. Most other web app servers offer

(1) request isolation (so most failures in one request can't break other requests) and

(2) a way to catch all exceptions/errors in a single request (and domains don't accomplish this, unless you know what to expect errors from, or wrap everything).

Since node.js doesn't offer those features, it's not even as fault tolerant as PHP was 15 years ago. I'm a huge fan of node.js, but one of the hardest things to do on a large application with a large number of users is to keep an instance of the server from restarting and dropping all the other in-progress requests. If you write your node.js code to be crash-only (like one might do with erlang) your clients are going to have a terrible time.

We had the problem you're describing for awhile, but have since figured out how to avoid processes going down and interrupting other reqs. Essentially, you attach a global domain, and when that domain catches an error you stop accepting new connections (obviously you have to be load-balancing between procs) and start a countdown. Some reasonable amount of time later (I think we wait 30 seconds?) you assume that any in-progress request is done and restart the process. We've found this to be very successful.
We do this too, attaching req and res objects to a domain, as well as databases and other network related objects (like smtp clients, etc). This is a huge improvement, but I'm still seeing occasional uncaught error events in our logs on a very large codebase and only in production. Some of them are just ECONNRESET events with no details given, so their origins are REALLY hard to track down. Have you got some magic for catching everything without explicitly having to find all objects that could be emitting? I'd love to hear it if so...
As soon as possible during startup, create a domain and enter it. Because entered domains form a stack, this will be a fallback if an error occurs at a place that isn't covered by any other domains.