Hacker News new | ask | show | jobs
by colmmacc 3212 days ago
Wow that sounds awful - but thankfully this isn't typical. I'm going to go digging for a case-id/issue and see what's going on myself (please e-mail the case if you have one). Re-configurations are routine and graceful.

From your description it may be that you have long lived connections that build up over time, at a rate that targets can easily handle, but that the re-connect spikes associated with a target failure/withdrawal are too intense. This is a challenge I've seen with web sockets: imagine building up 100,000 mostly-idle web sockets slowly over time, even a modest pair of backends can handle this. But then a backend fails, and 50,000 connections come storming in at once!

Another scenario is adding an "idle" target to a busy workload, but it not being able to handle the increased rate of new connections it will get. Software that relies on caching (including things like internal object caches) often can handle a slow ramp-up, but not a sudden rush.

We're currently experimenting with algorithms that allow customers to more slowly ramp-up the incoming rate of connections in these kinds of scenarios.

Anyway, those are guesses, so I may be wrong about your case, but hopefully the information is still useful to others reading.