Hacker News new | ask | show | jobs
How we found a bug in Amazon ELB (sysdig.com)
151 points by davideschiera 3703 days ago
14 comments

Great article. The Sysdig team really knows how to root cause tough problems. The Sysdig tools can be invaluable for getting and making sense of low level data.

If you want to play with ELBs, rolling deploys, connection draining to ECS containers, I humbly submit the open source Convox project I am working on.

https://github.com/convox/rack

It sets up a peer reviewed, production tested batteries-included VPC, ECS, ASG, ELB, etc cluster in minutes.

If the conclusion of this Sysdig post was that you always need to run 2 instances per AZ for the best reliability, I would strongly consider adding that knowledge into the tools either as a default or a production check.

Since it sounds like an ELB bug I'll keep the 3 instances in 3 AZs default.

Do you have any thoughts on how to scale load balancers horizontally and on demand? I've played briefly with attempting some dynamic DNS routing based on health checks to re-route traffic from balancers that have been shut down due to low traffic, but DNS really isn't designed to work this way.
I'm not clear what you're asking... Do you mean auto scaling the EC2 instances in the load balancer? Or auto scaling the # of load balancers? Or something else?

Of course the former is very common with Auto Scaling Groups [1] [2]. Then you can use round robin or session sticky routing algorithms in the load balancers.

(Apologies if I'm totally off-base for what you were asking.)

1: http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide...

2: http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide...

I meant the latter; if your load balancers are overwhelmed how do you scale them? Further to that point, is it possible to create an architecture where load balancers are responsive and can spin up in response to traffic? If you have to deal with loads that are prone to bursts, you need to allocate those load balancers in such a way that they can handle the worst case scenario.
I haven't worked at the scale where dynamically scaling the load balancers themselves is the bottleneck. I think you pose an interesting question, and I'm hoping someone with more knowledge can comment on that.
I don't believe there is any way to scale out the ELB from user side; you can contact AWS support to 'pre warm' ELB's for high traffic sites before cutting over DNS to them.

http://aws.amazon.com/articles/1636185810492479#pre-warming

DNS based load balancing has a lot of caveats due to how clients can hold onto stale information. To do better means stepping up to running anycast via BGP, a signifigant jump in complexity. That's why services like Cloudflare exist.
As a network engineer, I'm constantly having to prove that "it's not the network" so I love reading others' technical analyses of similar things. Great troubleshooting and technical detail in this write-up.
What are some examples where you proved this? Curious about scenarios...
We were told ELBs are explicitly not designed for long-running connections when we ran into this exact same issue so know that you will always be working around this design constraint if you do long-running connections through ELBs.

There's another case that the article doesn't really discuss (though the evidence of it is in the beginning when all connections drop simultaneously) where the ELB nodes themselves scale vertically at a particular threshold. I believe the setup described is still vulnerable to those scaling events.

We definitely observed such drops that we attributed to presumably internal ELB scaling activity, but they happen so occasionally that for the moment they haven't been a real issue, as opposed to this one described in the article which happened consistently at every deployment in our test environment.
Yeah, we've decided to live with the internal ELB scaling risks for the moment as well. We had the exact same situation where a deployment without gradual connection draining (even if we kept an instance in service in every AZ) would cause the ELBs to scale and drop all of our connections every time once we were at a certain scale. Definitely caused us a fair amount of confusion as it would happen minutes after the deploy when everything seemed to be calmed down again.
The author said he needed at least 2 instances in a AZ to avoid the bug, and used that as his workaround in the mean time that Amazon works on the bug.
Really interesting. Only two weeks ago we've been told by an AWS Architect that "if you need persistent TCP connections to servers avoid the ELB and connect straight with your [scaled] EC2 instances". This was for a higher load scenario though.
In general, if you are using ELBs you should have at least 2 instances per AZ or cross zone load balancing enabled. I've seen this get teams several times.

The other thing to consider when deploying to the cloud with load balancers is to use an immutable architecture. Taking hosts out of service, updating them, and putting them back in service is a bit cumbersome at best and leaves you vulnerable to service outages.

While I agree with having an immutable arch is preferable but in some cases it's not viable. In one of our projects we re-use the instances like in the article since we deploy multiple times an hour. In AWS you are billed for each started hour which in this case would mean that we would pay a lot extra if we created new instances for each deploy.
I do wish AWS had more granular EC2 billing, and I expect that to come soon since GCE offers it. But 2 things:

1) If you are at the scale of deploying several times an hour, the instance hour cost would probably look like a rounding error for your entire AWS spend, I'd imagine.

2) At that cadence you'll definitely benefit from using containers and a container scheduler (Kube, ECS, etc). Reuse the infrastructure but redeploy your apps to your hearts content.

Is Elastic Beanstalk not an option? It doesn't replace hosts on redeploy so you wouldn't end up cycling through unnecessary instance.
Genuine question: Why is it okay to reuse instances because it's controlled via an abstraction layer, as opposed to doing it yourself?
I agree, that doesn't make a difference.

I only mentioned EB because it does that kind of thing for you and if you don't have a highly complicated setup it makes rolling updates without changing instances very easy.

I've heard that cross zone load balancing means the vpc encryption does not cover traffic between zones (the traffic is isolated like in ec2 classic). Is that substantiated?
Network communication between instances in a VPC is not encrypted, and never has been, to my knowledge. Perhaps you're thinking of VPN?
We recently discovered that the NAT Gateway also terminates connections by issuing a RST packet when it receives the next packet for a connection that it believes to have timed out, effectively causing the new request to fail. The previous recommended approach of NATing in VPC was to use NAT instances, which sent FIN packets when the timeout was hit, cleanly closing the connection. That behavior was far better, since it indicated that a new request should re-connect first.

AWS Support indicated that this was a feature of the new NAT Gateways, even though it breaks outbound connections made by popular implementations such as the Requests python library's urllib3 connection pools. This is pretty unfortunate, and has been a roadblock in migrating to the NAT Gateways.

Full disclosure: I'm an engineer at AWS and I work on NAT Gateway :)

Thanks for the pointer to urllib3 - we'll take a look at it and see if there's anything we can do about the behavior. One of the challenges with sending "FIN" on timeout is, as you write ... it closes the connections cleanly.

Some TCP based protocols (Including even HTTP in some modes) use a successful connection close to indicate that an object has been transferred fully; so what we've seen is that a network connection may stall (internet packet loss for example) ... then the connection eventually times out ... and the "FIN" falsely conveys that the entire object has been transferred. The end result is a truncated object, which is no good either.

Thanks for the explanation colmmacc! I agree with the challenge you described, and am not sure what the best approach would be. Perhaps a configurable timeout such as ELBs have?
Somewhat unrelated to the ELB problem identified, but an alternative solution to the original deployment problem: assuming that the collectors are stateless (seem to be) start off the deployment by spinning up a new collector with the new code installed. Then, proceed with the deployment in the original fashion. Once that's over, kill the extra collector. This will ensure that load is distributed roughly in the same manner, over the same number of nodes during the deployment as before the deployment. Depending on load caused by initiating a connection, more than one extra mode may be utilized. In any case, this is a much simpler approach than baking in application-level connection termination. All for a few extra bucks per deploy and a small amount of engineering time up front.
Definitely a feasible approach. Let's just say that the reality has a bit more color and we have some other practical advantages in controlling the exact moment when we disconnect a particular client :)
I don't really see a benefit in updating existing instances in this manner. Launching replacement instances with the new code is much easier for us, and it also provides a super fast means of rollback.
Both approaches are reasonable (and there's also a third one, ship your application in containers and replace containers instead of instances).

We update existing instances because in our test environment we deploy at every single new commit (we absolutely love that), and we have hundreds (or more) a day. At that pace, replacing instances would be more time consuming (again, for our specific use case) and less cost efficient.

Plus, updating existing instances is handled automatically by AWS Code Deploy, which provides a very good deploying pipeline that you can control using the aws cli tool.

There are other minor advantages but those are the two main ones.

That's an awesomely aggressive deployment rate and a great reason to do instance mutation.

Does something verify every commit in the testing environment too?

Yes, every commit gets pulled by jenkins which builds the whole thing, runs unit tests and then starts the deployment once the tests pass.
It's really hard to replace the simplicity and reliability of letting ASG (and generally CloudFormation) roll out new instances.

However best practices always evolve...

I'd say that rolling out containers on ECS is starting to really show advantages.

It is now generally:

- easier to build and push an image than burn an AMI - faster to boot a container than an instance - faster to finish a deploy with options like min containers in service and a slack instance or two

To be honest most teams don't actually need the extra agility that containers promise.

But if I was starting an AWS setup from scratch I'd strongly consider containers on ECS.

In addition to the speed there is more portability with containers and a whole new generation of tools coming in the ecosystem.

One potential issue with spinning up new instances on EC2 is that for larger instance sizes, if you care about the instances being in a certain AZ, there may not be enough available to do this.
We experienced something like this a long while ago, something like 4-5 years. We still employ our workaround, which is to have a tiny "keepalive" instance in each AZ in the ELB.
When the cloud work as desired, life is grand.

But when it doesn't, debugging might actually be simpler with less black boxes between you and the metal.

Hmm, this seems like a pretty big bug in connection draining. I feel like one instance per AZ is a pretty common scenario. Great article!
To be fair, the scenario is less common due to the fact that it happens just when the drained connections are terminated in a certain pattern (as shown in the charts). Still definitely common enough that can be easily replicated and cause real troubles :)
This is a great article to read.

The author mentions WireShark - fun fact: the founder of Sysdig, Loris, is also the creator of WireShark.

He created WinPcap, not Wireshark.
Nice article.
Great work done!
wonderful debugged the issue.