Hacker News new | ask | show | jobs
by mononcqc 2096 days ago
Some services may require gigabytes of state to be downloaded to work with acceptable latency on local decisions, and that state is replicated in a way that is constantly updated. These could include ML models, large routing tables, or anything of the kind. Connections could be expected to be active for many minutes at a time (not everything is a web server serving short HTTP requests that are easy to resume), and so on.

Changing the instance means having to re-transfer all of the data and re-establish all of that state, on all of the nodes. You could easily see draining of connections take 15-20 minutes, and booting back and scaling up to be taking 15-20 minutes as well, if you can do it for _all_ the instances at once (which may not be a guarantee, and you could need to stagger things to be more cost-effective).

You start with each deploy taking easily over an hour. If you deploy 2-3 times a day and that your peak times line up with these, you can more than double your operating cost just to deploy, and that can take more than 4 figures to count.

Some of the systems we maintained (not those we necessarily live deployed to, but still required rolling restarts) required over 5,000 instances and could not just be doubled in size without running into limits in specific regions for given instance types.

If a blue/green deploy takes a couple minutes, you're probably not having a workload where this is worth thinking about that much.

1 comments

This sounds like an interesting usecase.

Did you look at any of the node-based options like checkpointing the state to a file on the node, and loading that into your newly started pods? Or using read-many persistent volumes? (Not sure if you needed to write to the state file from every process too?)

(This doesn’t help with connections of course, that’s a bit more thorny.)

> checkpointing the state to a file on the node, and loading that into your newly started pods

For some types of my nodes, the majority of the state was tcp connections and associated processes. I don't think there's a generally available system that is capable of transferring tcp connection state (although, I'd love to build one! if you've got a need, funding, and a flexible time table), which would be a prerequisite to moving the process state. All of those connections need to be ended, and clients reconnect to another server (where they might need to do it again, if they don't get lucky and get a new server to begin with).

The other nodes with more traditional state had up to half a terrabyte of state in memory, and potentially more on disk, and a good deal of writes. That's seven minutes to transfer state on 10G ethernet, assuming you can use the whole bandwidth and sending and receiving that data is faster than the network.

Although, in my experience, we didn't tend to explicitly replicate disk based storage for new nodes, all of our disk based data was transient, so replacing nodes meant writing to new nodes, read from new and old, and retiring the old nodes when their data had all been fetched and deleted, or the data retention cap was missed.

I/O Volume meant networked filesystems would be a big stretch. You could probably do something with dual-ported SAS drives, and redundant pairs of machines on the same rack, but then that pair with both go down when that rack has an unforseen problem, plus good luck getting dual-ported SAS drives hooked up properly when you're in someone else's bare metal managed hosting.

(Yeah OK, maybe we had big performance requirements, but hotloading works just as well for stuff that fits on a single redundant pair, or even a single server in a pinch)

Interesting case study, thanks for sharing!