Hacker News new | ask | show | jobs
by teraflop 3173 days ago
I don't want to knock the technical achievement here -- it's a cool hack -- but I'm really surprised that it was deemed to be the best choice for a production system.

In the first place, "we can't compile our code on every change because it takes too long" is a really awful situation to be in. Are developers not building and testing their changes before deploying them? Can Facebook not afford a continuous integration system that can run builds in parallel? It sounds like this problem is only happening because the application is a giant monolith, but for some reason splitting it up would slow down development even more... I'm not sure I buy that reasoning.

The article says that "Haskell’s strict type system means we’re able to confidently push new code knowing that we can’t crash the server", which is a real stretch. In addition to all of the usual ways a computation can diverge, this hot-swapping system adds a whole new variety of failure modes. The article talks about how the code needs to be carefully audited to prevent memory leaks, but it doesn't even mention the weird things that can happen when mutable state is preserved across code modifications. Debugging is a pain when your data structures can get into states that aren't reachable with any single version of the code. (This is a well-known issue in Linux kernel live-patching, for instance.)

8 comments

I should have emphasized the speed of deployment being a first order concern more. We certainly can (and do) build our code for every change, but not at the speed that we want to be updating.

We use a monorepo for all of the benefits it has, and deploying fast business logic updates this way helps mitigate one of its downsides (particularly when you've maximally parallelized the build). I've found https://danluu.com/monorepo/ to give a quick overview of how chopping up the repo would have separate downsides.

The section about "Sticky Shared Objects" speaks directly to mutable state across code modifications, just with a Haskell-minded focus.

How much is this because of Haskell's build times in particular? Is there a sort of "target build time" that would make you more comfortable with this stuff
I don't think coming across these problems in general is Haskell specific. We've grown enough to bubble these issues up in this Haskell project, but would have needed to do something much sooner if this was C++.

> make you more comfortable with this stuff

Which stuff are you referring to? Overall I'd love if all builds were significantly faster, so we contribute to upstream GHC to make it better in the ways we come across. Our platform has a deployment SLA that we strive to maintain as our "target build time".

It kind of sounds like they're running into some limitations of GHC: it tends to take a long time to compile stuff, and it tends to generate some very big binaries. For most applications, those aren't major problems but in their use case (hundreds of thousands of lines of code deployed to many servers) it is an issue so they're working around it. That allows them to keep working in the language they prefer and are productive in, which is great.

Improving GHC compile times and reducing the binary size would be better, but presumably a lot of work has already gone into those problems and if it were easy someone would have done it by now. As for myself, I really like using Haskell and I'm glad whenever I hear about it being used in industry.

While I agree a slow build indicates a problem with their build infra, haskell’s purity and type system do rule out issues with mutable State (presumably there isn’t any in the hot-swapped module) and invalid states (the type system prevents invalid states from being constructed, given the way they have a fixed hot-cold API).

The article describes the hot-swapped module as containing frequently changing business logic, which sounds like it’s something they can probably do via an interface with well-constrained or no mutability.

I remember at Standard Chartered they have a Haskell monolith project of a few million LoC and they are relying on incremental building. [1]

I wonder why that wasn't an option for facebook.

[1] From podcast: http://www.haskellcast.com/episode/002-don-stewart-on-real-w...

I'm not 100% sure, but they (Standard Chartered) do use a custom compiler. That might explain the difference
I think the main benefit is the middle point. It sounds like they have programs with huge memory footprints and (I’m guessing) caches that take a while to warm up. This lets them avoid that. Fraud detection is probably time sensitive and slow responses aren’t acceptable.
They could transfer the cache data from one (old) server instance to another (new) one.
I agree. Fun read and cool hack, but it definitely feels like they are stretching to justify the more fun of the two options (spend time on this or spend time fixing the root cause).
Glad you guys both can make a better trade off than the engineers that actually have their hands on the problem. /s

You're reading a blog post, you do not know all they have tried, nor the various intricacies they're dealing with.

Yeah, my initial reaction was "I can see how these design decisions might make sense, but the blog post is horrible."

These kinds of designs typically emerge over a long and windy history and, for someone who was part of that process, it's difficult to coherently describe the final state to an outsider. Good textbook authors have this skill. Most tech blog authors do not. (I think that part of the problem is that people don't respect just how difficult it actually is.)

My guess: restarting a large fleet of processes is a pain. The rollout will typically be throttled to avoid connection churn, among other things. For risky code changes, you probably want a slow rollout anyway, but if you're just tweaking abuse detection rules (almost just a config change), it's nice to have your changes take effect more quickly. Dynamic loading seems like one reasonable way to achieve that goal.

Tangent: people, please stop making analogies to mechanical engineering feats that are WAY more difficult than what you did [1]. People have been loading shared libraries forever; it's like adding an AUX port, not swapping out the engine. It's not even in the same league as Ksplice or as the JVM's dynamic loading/deoptimization.

[1] http://jensimmons.com/post/jan-4-2017/replacing-jet-engine-w...

You're right, I don't know all the intricacies of their system. That's why I said "I'm surprised" rather than "this is a bad design decision". It doesn't mean I can't point out potential pitfalls that I think the blog post glosses over.
They explained their justification; if they don't want random people on random forums disagreeing with their justification because it wasn't complete enough, they are free to make it more complete.
"In the first place, "we can't compile our code on every change because it takes too long" is a really awful situation to be in."

Isn't this exactly the problem Go was invented to solve?

It was one of them. However, given the other writing/talks Facebook has put out about their usage of Haskell and Haxl, Go is probably not a good fit for their use case due to language expressivity concerns (not declarative enough, not enough type safety, not syntactically flexible enough for writing DSLs).
It "solves" it by not doing any of the things that you'd expect a modern language's compiler to do.

In my opinion the time wasted debugging Go issues that could have been statically prevented is better spent waiting for a slightly longer compile cycle to finish.

No because fast compile times were already solved in the 70's, when using languages like CLU, UCSD Pascal, Mesa and Modula-2.

The authors might have done it in regards to waiting for C++ builds, but the problem was not a problem for those using other programming languages.

JVM hotswapping is ages old, but it's usually used only in testing, not production.