Hacker News new | ask | show | jobs
by donavanm 1441 days ago
Hey Luca, some thoughts from working on similar systems.

Visibility is the cause & lesson learned on duration. It's worth simply paying for 3P distributed RUM. Make sure you can get down to /24s & ASNs as well as breaking it out by (your) target destination/address. I reallly like TurboBytes in the past. Cedexis was ok, but I remember the API/raw data access to be bit of a pain.

It sounds like your TCP LB wasnt exporting metrics this time. For other cases you can get decent data out of the tcp metrics cache on linux. And proc has some good counters even before you get a socket; PAWSPASSIVEREJECTED may have bitten me before :( Make sure your reads of /proc/net/netstat are aligned to the right size if you go that route.

> ... because the load balancer that failed was very early in the network stack (a TCP load balancer). It does not record any diagnostics about dropped connections ...

You may be able to sort some improved visibility with something like netflow/sflow. This aligns well with discrete components and independent failure domains as well.

> Services announce themselves to the etcd cluster when their availability state changes ... If there are no healthy backends it will un-advertise itself from the network to prevent requests ending up at this "dead end".

In my experience you really can't rely on nodes to manage themselves when it comes to service availability or health. There are too many grey failure cases where a dataplane node will partially fail enough to keep mangling traffic or passing shallow health checks. eg a disk going read only or stalled IO can keep the LB and active data in memory up, signalling like BGP sessions stay up, but prevent consuming new system/customer state updates. A seperate system/component is necessary for teh control loop to be insulated from those failures.

You end up in a situation where the distributed LB has "data plane" workers that handle connections & packets while the out of band "control plane" determines health & controls BGP/routing/ARP/whatever to put the data plane nodes in or out of service. Your application/lb/etc data plane can still self report & retrieve data from etcd. But put the control somewhere with less correlated failures. While you're at it build data versioning in to your configuration, eg active customers/domains/etc, that your dataplane uses & reports. That way your control plane can check both the availability/performance and the current working state of LB dataplane configuration.

> [The LB did not have] any healthy backends to direct traffic to. ... This caused the traffic to be dropped entirely.

Throwing a RST or similar here is not wrong per se, and is a nice clear failure mode. One other approach is to have something like a default route that you can punt traffic to (and alert) as a last resort. It depends on your network/LB configuration but this could be a common MAC address, an internal ECMP'd route, or similar. I think you'll see many services that build L3/4 LBs, like CDNs, take this approach. IIRC google maglev and fastly document their take on this to deal with problems like IP fragments and MTU discovery where some packets dont flow with the rest of teh 5 tuple.

> The region will remain disabled until our monitoring has improved and the issue has been fixed more permanently.

I understand if this choice is around business & customer confidence. However I didnt see anything that indicated your failure modes were specific to us-west3. It seemed to be that visibility & detection were the real failure. And in that case I'd posit the better path is getting global visibility in to your failure mode, deploying that first/early to us-west3 and use that as your gate.

edit: Im a couple years past doing distributed networking/lb systems as my full time job, so apologies if this is dated/fuzzy advice.