Hacker News new | ask | show | jobs
by NathanKP 2124 days ago
There are any number of reasons to avoid restarting things. Some customers are running code that has a cold start and needs some time to warm up its cache if it restarts. Some customers are running jobs (video rendering, machine learning training, etc) that might take literally days to complete. Interrupting these jobs and causing them to restart wastes the customer time and causes them to lose progress. Other containers may be hosting multiplayer game servers, and forcing them to restart would cause all people logged into the game instance to get disconnected or otherwise dropped from their game.

All of the above are use-cases that AWS Fargate is used for. Beyond this many folks simply don't like it when things happen unexpectedly outside of their control. We have Fargate Spot for workloads that can tolerate interruption, and we discount the price if you choose this launch strategy. However Fargate on-demand seeks to avoid interrupting your containers. You are in control of when your containers start and stop or autoscale.

3 comments

This makes a ton of sense and I appreciate the response. I think what people aren't recognizing is that cloud services make you pay for performance, so doing things like relaunching containers which have slow warmup time literally costs extra money. While it's certainly important to design systems such that the containers can be tossed aside easily, that doesn't mean there isn't value in reducing how often that tossing aside occurs.
Forgive me my hijack

Any plans to reduce the minimum bill time for Fargate to accommodate short tasks?

With 1 minute minimum billing you have to turn to lambda for very short tasks or have a long running Fargate consuming tasks from some message bus.

If you choose lambda, your containers don’t work so you need to rebuild your runtime with lambda layers or ebs or squeeze into the lambda env.

If you choose messaging, say SQS from a lambda called by API gateway you’ve complicated your architecture and your Fargate instance is potentially hanging out billing, idle, and waiting for messages.

Fargate spot removed the last reason to consider AWS Batch. Short tasks could largely replace lambda.

It would be nice to Fargate all the things.

This stuff is probably waaaay over my head, but isn't that why SIGTERM was made for ? To notify a running process that the host needs to be shutdown/restarted and to let the running process finish it's current task (current frame encoding / current multiplayer game / current request / ...) and that the state / cache / progress / ... needs to be saved.

The process on aws side would then be : send SIGTERM to all workloads. wait for [configurable] amount of time (maxed at xx hours) or until all workloads have exited (whichever comes first). Shutdown the node. Update the node. Start the node. Restart the workloads.

Yep you are right about SIGTERM, but let's think back to the original reason why we wanted to update the node: because of a patch, probably a security patch for a CVE?

What is the better option here? Implement a SIGTERM based process that allows the user to block the patch for a critical, possibly zero-day CVE for xx hours, remaining in a vulnerable state the entire time? Or implement a system that just patches the underlying host without interrupting the workloads on the box?

You aren't wrong, what you described is a possibility, but it is not the best possibility.

If there's a CVE vulnerability that is being actively exploited on your network, you should preempt running processes to deal with it, and absolutely must take the boot+nuke approach, because it already could be affecting any host that has not already been boot+nuked?

If there's not a CVE, AWS can significantly manage the lifecycle of their machines, and have ~5% of all of their machines "unschedulable" at any one time, waiting for existing processes to complete so that they may use an orderly restart before doing a boot+nuke. A SLA of "Tasks may never run longer than X days"(x=10-30) allows them to perform orderly restarts.

I don't know your background but the way you respond makes me think you have not been responsible for systems that multiple tenants rely on for varying workloads.

These assumptions you're making are dangerous because the variety of workloads across tenants is extreme. If you're going to do something like "kill compute no matter what" then you better have a good reason for it.

You may want to look at my resume. I've seen what happens when you don't "kill compute no matter what" - When compute does get killed no matter what (hardware problems happen quite often at scale), you have problems. I've also seen it done right. Clearly, Fargate has not - I could also tell you that from having used the service.