Wait, what? I wrote an Elixir/Phoenix webapp around 2015 and one of the major selling points was the ability to hot-upgrade code that was inherited from Erlang. What changed?
I compare it to unwrapping a handfull of utility razors chucking them up in the air and saying "free razors!".
You CAN do hot upgrades, but many times the complexity of doing so far outweighs the benefits. For any non trivial app it makes updates/deploys to the app non trivial as well.
The problem with hot code updates is they can break your app if your structs change shape and are extremely hard to get right over a period of time. Best to do rolling deployments with Kubenetes or something...
You make it sound like setting up k8s is a walk in the park! I've used hot code upgrades in production for 3+ years, and we've had it blow up in our face perhaps twice. If you do canary deploys, it's not hard to control the blast radius.
Hot code updates can be challenging at times, but the decrease in deployment time and hassle for changes where it makes sense is definitely worth it, and using hot code loading for most updates doesn't prevent you from using a rolling restart for the updates where it's more appropriate.
If you're using distribution, there's a good chance you need to deal with messages from both updated and not updated nodes anyway, so handling different shapes from different versions is a skill you already need to have; it's just hot load exposes it at more layers than previously.
Since working in Erlang for several years, when I have to work in languages where hot loading is not easy or common, it's always frustrating. Throwing away connection attached state to update code is a hard choice that doesn't need to be.
> The problem with hot code updates is they can break your app if your structs change shape
I have no experience with Erlang/Elixir myself, so please excuse me if this is a silly question – but, why can't the language/platform actually detect the difference in struct shape, and refuse to deploy the new version in that case? Or, demand that you provide some sort of conversion function that maps the old struct to the new one? Is this a consequence of lack of static typing?
What it doesn't do is give you a hook to relup the messages that get passed between processes. If you do it right, you can pattern match against different versions of messages, and do the correct conversion function, but there are liable to be many, many, shapes of messages (including ones that you might not write yourself, e.g. coming from a library) and it might be difficult to catch them all.
To do it right, I would want to set up a lot of testing to make sure you can hot code reload safely. There is no specific guideline, and it probably hurts forward progress in developing guidelines, that generally, it's okay to have some downtime in any individual server node as erlang/elixir encourages you to think about failover anyways, and most elixir apps are relatively stateless webservers, so you've probably got robust load balancing and migration scheme in place in your cluster to begin with, making blue/green or rolling updates a "pretty much good enough" thing.
It was talked about a lot by people who were fans of elixir/erlang but hadn't really done much work with it (these folks are increasingly common in tech, confusingly so sometimes but good).
Digging deeper the line from folks using erlang/elixir "in anger" was always that it was a supported feature but the reality is that most people shouldn't do it and wouldn't need to for the hassle it has.
Nothing changed; hot code reloading is still possible through the mechanisms that have always existed. What's being stated here is that the new Elixir support for "releases" does not include support for this, so you would have to use a separate mechanism to perform the hot code reload.
That "separate mechanism" could be as simple as a plug that tells Mix to recompile and reload. Example (designed for - and part of - the Sugar framework, but should theoretically work for any Plug-based app, including Phoenix-based ones): https://github.com/sugar-framework/plugs/blob/master/lib/sug...
Hot upgrades like this are not foolproof (which is likely the reason why the above example is gated to :dev environments); there are other concerns like database migrations and other internal and external variances that make this inappropriate for most production situations. That said, these same concerns often exist for other high-availability situations as well, so if you know that you want zero-downtime code upgrades, figuring out a way to do it cleanly within the application is likely valuable as a way to avoid the hell on Earth that is trying to do this with, say, a bunch of load-balanced Docker comtainers.
As a bit of a correction here to my own comment, it looks like Elixir releases don't ship with Mix, so the above example probably wouldn't work in that particular scenario (unless maybe you figure out some way to include Mix).
So the better option would be to figure out a way to compile a new release, get the modules in place where the currently-running release expects them, then write up a plug similar to the above to check for new module versions and load them (or just do it as part of whatever mechanism you'd use to get the updated modules into the running release in the first place).
I would be curious to know what makes this feature a major selling point. Aren't rolling deploys easier to handle? I'm not saying it's not a great feature, but honestly, it probably stopped being useful since Erlang/OTP was originally designed.
Hot code upgrades are one of the Erlang features that helps me keep headcount down on our operations team. Appcues is three years into running Elixir in prod, and we still have not had to set up blue/green deploys because the built-in upgrade and rollback feature is so robust. For the first year and a half of that, we had a single platform developer (hi!) supporting dozens of servers and hundreds of customers. Hot upgrades are very, very useful.
The cognitive load is quite low when deploying changes which only involve your app code (i.e., no dependency upgrades). My app has a few tens of millions of open websockets at a time, and it's worth it to me to avoid mass reconnects. I'm not everyone, but my use case isn't totally unique either.
I am not sure about your specific use case but it owuld be hard to imagine that graceful reconnect on websocket + a more traditional deploy would not be much more effective resourcing vs that hurdles you have to go through to properly appup. Is it really an advantage to you or has it slowed your ability to get to market for something that seems like it "should" be an advantage.
Its not like you don't need graceful reconnect handling logic for the websockets anyways -- net-splits and endpoint failures are a thing.
It really depends on how much effort is involved in setting up the connections, and how much effort is involved in directing connections to a new host.
If I'm running 1M sessions per machine, and each machine can handle 10k new connections / second, rolling restart is really expensive. If I'm running 10k sessions per machine, and can handle 10k new connections / second, rolling restart isn't too bad. This generalizes to really anything that gather essentially ephemeral state, but that state is costly to gather (tcp flows in this case, data caches in others, etc).
BEAM excels at applications with huge numbers of sessions per machine, which is why some users really value hot reloading.
Editted to add -- the OTP application update sequence doesn't necessarily need to be used. Where I work, we certainly don't do that. Just a little bit of logic around code:soft_purge/1 and code:load_file/1
"Few tens of millions of open websockets" is more than Slack and Discord combined, not?
BTW: we were the first who did 1M long-living connections load test with Cowboy on a single EC2 instance back in 2011. And 3M long-polling HTTP requests on a single beefy physical server.
It was before WhatsApp upstreamed their optimizations and before Phoenix team made it easy.
I imagine both Slack and Discord have much higher numbers than this, as well as a very different workload (they're doing chat, we're not).
I don't want to give the impression that this is happening on a single server, either! I did the cowardly thing and threw more boxes at the problem. :)
With a single code base capable of having millions of processes running as the norm, some handling direct client requests and others handling in-progress work, data storage, holding open connections for transfers, etc...you get the capability to deploy without disrupting ANY of that.
Most run times can't do anything close to that. Think about all of those X million websocket benchmarks...now think about being able to deploy without forcing all X million to try to reconnect at the same time.
And it can do this while all of the nodes are connected and communicating with each other as well as the outside world.
For standard issue client server, it's not that big of a deal. You just separate the web parts behind a load balancer.
For background workers, long lived connections, web sockets, video/audio streams, file transfers (CDN)...it's huge.
Having to support two different versions of the Elixir service during the rollout period is risky...
What if something goes wrong during the rollout? For example, if you change the database schema or upgraded your database engine or changed your back end authentication approach, it can break the old code. Then how will you know whether it's a problem with the old code or new code if many nodes are running both?
You CAN do hot upgrades, but many times the complexity of doing so far outweighs the benefits. For any non trivial app it makes updates/deploys to the app non trivial as well.