Hacker News new | ask | show | jobs
by dudul 2555 days ago
I would be curious to know what makes this feature a major selling point. Aren't rolling deploys easier to handle? I'm not saying it's not a great feature, but honestly, it probably stopped being useful since Erlang/OTP was originally designed.
2 comments

Hot code upgrades are one of the Erlang features that helps me keep headcount down on our operations team. Appcues is three years into running Elixir in prod, and we still have not had to set up blue/green deploys because the built-in upgrade and rollback feature is so robust. For the first year and a half of that, we had a single platform developer (hi!) supporting dozens of servers and hundreds of customers. Hot upgrades are very, very useful.
yes, but they do come with a lot of cognitive load and sharp edges. For many apps they are simply not worth it.
The cognitive load is quite low when deploying changes which only involve your app code (i.e., no dependency upgrades). My app has a few tens of millions of open websockets at a time, and it's worth it to me to avoid mass reconnects. I'm not everyone, but my use case isn't totally unique either.
I am not sure about your specific use case but it owuld be hard to imagine that graceful reconnect on websocket + a more traditional deploy would not be much more effective resourcing vs that hurdles you have to go through to properly appup. Is it really an advantage to you or has it slowed your ability to get to market for something that seems like it "should" be an advantage.

Its not like you don't need graceful reconnect handling logic for the websockets anyways -- net-splits and endpoint failures are a thing.

It really depends on how much effort is involved in setting up the connections, and how much effort is involved in directing connections to a new host.

If I'm running 1M sessions per machine, and each machine can handle 10k new connections / second, rolling restart is really expensive. If I'm running 10k sessions per machine, and can handle 10k new connections / second, rolling restart isn't too bad. This generalizes to really anything that gather essentially ephemeral state, but that state is costly to gather (tcp flows in this case, data caches in others, etc).

BEAM excels at applications with huge numbers of sessions per machine, which is why some users really value hot reloading.

Editted to add -- the OTP application update sequence doesn't necessarily need to be used. Where I work, we certainly don't do that. Just a little bit of logic around code:soft_purge/1 and code:load_file/1

"Few tens of millions of open websockets" is more than Slack and Discord combined, not?

BTW: we were the first who did 1M long-living connections load test with Cowboy on a single EC2 instance back in 2011. And 3M long-polling HTTP requests on a single beefy physical server.

It was before WhatsApp upstreamed their optimizations and before Phoenix team made it easy.

I imagine both Slack and Discord have much higher numbers than this, as well as a very different workload (they're doing chat, we're not).

I don't want to give the impression that this is happening on a single server, either! I did the cowardly thing and threw more boxes at the problem. :)

It's a difference in run time purpose.

With a single code base capable of having millions of processes running as the norm, some handling direct client requests and others handling in-progress work, data storage, holding open connections for transfers, etc...you get the capability to deploy without disrupting ANY of that.

Most run times can't do anything close to that. Think about all of those X million websocket benchmarks...now think about being able to deploy without forcing all X million to try to reconnect at the same time.

And it can do this while all of the nodes are connected and communicating with each other as well as the outside world.

For standard issue client server, it's not that big of a deal. You just separate the web parts behind a load balancer.

For background workers, long lived connections, web sockets, video/audio streams, file transfers (CDN)...it's huge.

For all that to work you will need:

- to understand exactly what your app is doing

- to understand exactly what Erlang/Elixir releases are doin

- to understand exactly how cod upgrade works

- to understand exactly or very damn well how to make the system handle those 1 million connections

- to understand exactly how to handle all the things you wrote about

And then, and only then will you be able to "think about being able to deploy without forcing all X million to try to reconnect at the same time."

There are no magic bullets.

Eh...it’s basically a 1 line command with distillery. Another for the rollback capability.

It’s pretty magical. There’s a reason people love it.

Certainly, don’t use it if you don’t need it...it introduces extra complexity...but if you do it’s really hard to beat.

Having to support two different versions of the Elixir service during the rollout period is risky...

What if something goes wrong during the rollout? For example, if you change the database schema or upgraded your database engine or changed your back end authentication approach, it can break the old code. Then how will you know whether it's a problem with the old code or new code if many nodes are running both?

Those are things you have to account for in any zero downtime deploy situation though. It’s mostly minor changes in how you roll out schema changes.
I was mostly listing things you need for Erlang. Elixir, thankfully, hides a lot of things away in much friendlier packages.

That said, even with Distillery, if something goes awry, you'll need to know the warts behind the magic :)