|
|
|
|
|
by mononcqc
2096 days ago
|
|
Some services may require gigabytes of state to be downloaded to work with acceptable latency on local decisions, and that state is replicated in a way that is constantly updated. These could include ML models, large routing tables, or anything of the kind. Connections could be expected to be active for many minutes at a time (not everything is a web server serving short HTTP requests that are easy to resume), and so on. Changing the instance means having to re-transfer all of the data and re-establish all of that state, on all of the nodes. You could easily see draining of connections take 15-20 minutes, and booting back and scaling up to be taking 15-20 minutes as well, if you can do it for _all_ the instances at once (which may not be a guarantee, and you could need to stagger things to be more cost-effective). You start with each deploy taking easily over an hour. If you deploy 2-3 times a day and that your peak times line up with these, you can more than double your operating cost just to deploy, and that can take more than 4 figures to count. Some of the systems we maintained (not those we necessarily live deployed to, but still required rolling restarts) required over 5,000 instances and could not just be doubled in size without running into limits in specific regions for given instance types. If a blue/green deploy takes a couple minutes, you're probably not having a workload where this is worth thinking about that much. |
|
Did you look at any of the node-based options like checkpointing the state to a file on the node, and loading that into your newly started pods? Or using read-many persistent volumes? (Not sure if you needed to write to the state file from every process too?)
(This doesn’t help with connections of course, that’s a bit more thorny.)