Hacker News new | ask | show | jobs
by NathanKP 2124 days ago
Yep you are right about SIGTERM, but let's think back to the original reason why we wanted to update the node: because of a patch, probably a security patch for a CVE?

What is the better option here? Implement a SIGTERM based process that allows the user to block the patch for a critical, possibly zero-day CVE for xx hours, remaining in a vulnerable state the entire time? Or implement a system that just patches the underlying host without interrupting the workloads on the box?

You aren't wrong, what you described is a possibility, but it is not the best possibility.

1 comments

If there's a CVE vulnerability that is being actively exploited on your network, you should preempt running processes to deal with it, and absolutely must take the boot+nuke approach, because it already could be affecting any host that has not already been boot+nuked?

If there's not a CVE, AWS can significantly manage the lifecycle of their machines, and have ~5% of all of their machines "unschedulable" at any one time, waiting for existing processes to complete so that they may use an orderly restart before doing a boot+nuke. A SLA of "Tasks may never run longer than X days"(x=10-30) allows them to perform orderly restarts.

I don't know your background but the way you respond makes me think you have not been responsible for systems that multiple tenants rely on for varying workloads.

These assumptions you're making are dangerous because the variety of workloads across tenants is extreme. If you're going to do something like "kill compute no matter what" then you better have a good reason for it.

You may want to look at my resume. I've seen what happens when you don't "kill compute no matter what" - When compute does get killed no matter what (hardware problems happen quite often at scale), you have problems. I've also seen it done right. Clearly, Fargate has not - I could also tell you that from having used the service.