Hacker News new | ask | show | jobs
by IMTDb 2124 days ago
This stuff is probably waaaay over my head, but isn't that why SIGTERM was made for ? To notify a running process that the host needs to be shutdown/restarted and to let the running process finish it's current task (current frame encoding / current multiplayer game / current request / ...) and that the state / cache / progress / ... needs to be saved.

The process on aws side would then be : send SIGTERM to all workloads. wait for [configurable] amount of time (maxed at xx hours) or until all workloads have exited (whichever comes first). Shutdown the node. Update the node. Start the node. Restart the workloads.

1 comments

Yep you are right about SIGTERM, but let's think back to the original reason why we wanted to update the node: because of a patch, probably a security patch for a CVE?

What is the better option here? Implement a SIGTERM based process that allows the user to block the patch for a critical, possibly zero-day CVE for xx hours, remaining in a vulnerable state the entire time? Or implement a system that just patches the underlying host without interrupting the workloads on the box?

You aren't wrong, what you described is a possibility, but it is not the best possibility.

If there's a CVE vulnerability that is being actively exploited on your network, you should preempt running processes to deal with it, and absolutely must take the boot+nuke approach, because it already could be affecting any host that has not already been boot+nuked?

If there's not a CVE, AWS can significantly manage the lifecycle of their machines, and have ~5% of all of their machines "unschedulable" at any one time, waiting for existing processes to complete so that they may use an orderly restart before doing a boot+nuke. A SLA of "Tasks may never run longer than X days"(x=10-30) allows them to perform orderly restarts.

I don't know your background but the way you respond makes me think you have not been responsible for systems that multiple tenants rely on for varying workloads.

These assumptions you're making are dangerous because the variety of workloads across tenants is extreme. If you're going to do something like "kill compute no matter what" then you better have a good reason for it.

You may want to look at my resume. I've seen what happens when you don't "kill compute no matter what" - When compute does get killed no matter what (hardware problems happen quite often at scale), you have problems. I've also seen it done right. Clearly, Fargate has not - I could also tell you that from having used the service.