Hacker News new | ask | show | jobs
by remram 1438 days ago
So three failures:

- The load balancer lost its connection to etcd and did not reconnect

- The load balancer had no healthy backend and did not un-advertise itself

- The load balancer did not report either of those issues to monitoring

Honestly this is a little concerning. Are they using their own load-balancing software? If yes, why?

3 comments

The system does have mitigation against the first two failures in isolation (as described in the post). The mitigations did not work correctly in this scenario with the combined failures unfortunately. This is obviously unexcusable, and we need to do better in the future.

To your final question: yes, we using our own load balancing software. We are building a global hosting platform that needs to be able to run on bare metal servers, not an end user application where load balancing is an afterthought. As such we can not use much of the software that a "regular" SaaS application may be able to. Some constraints our system needs to be able to solve:

- Our load balancers handle routing to 100s of thousands of unique deployments (services), all of which need to be accessible and routeable within milliseconds of a request coming in.

- We need to terminate TLS connections for thousands of unique domains.

- We need to be able to carefully control TLS handshakes, to be able to prewarm downstream services for an imminent request for a given deployment based on a TLS client hello SNI, before even having received an HTTP request yet.

- The system needs to handle hundreds of millions of hourly requests.

- The system needs to be able to run on bare metal.

- We currently handle 34 regions globally (up from 28 at the start of the year), which means that all of the data needed to fulfill the above requirements needs to be accessible from all of our PoPs in a matter of milliseconds.

For many companies global load balancing is something they can outsource to AWS, GCP, or Cloudflare. For us, this is core "business logic" that we need to have full control over. It's difficult for us to outsource, and it's questionable if it would be wise for us to do so. Building new systems is obviously always a complex undertaking, and there will be some stumbling stones in the way, but they can be overcome. We are still bullish that our path is the right one, even if we still have a lot of work ahead.

(if this seems interesting, and you want to work with us on building load balancers, among other things: https://deno.com/jobs)

I've helped build and run similar distributed systems with deep load balancing and network interactions. There are some packages out there that do bits and pieces of the problem. I don't know of anything COTS that is a suitable like for like replacement of the core components, much less a suitable system . On top of that, as Luca mentioned, you almost always get in to deep interactions between L3/4/5/7 and end up building bespoke logic that's tailored to the business or application needs. A trivial example would be the coupling from IP address assignment/announcement, to TLS cert, to SNI headers, to active customers, to application instance routing.
instead of the AWS ones?
Yes or even a more turn-key software package. It sounds like they had very custom software, I would expect that established load-balancing software doesn't fail to reconnect.