Well, there's the problem right there. They're using a dyno manifold to connect to the routing mesh. If they'd just use a flux capacitor, they could use a static manifold instead.
Love it. Times like this I wish there were more of a sketch comedy scene in our community... Not sure whether to go with "The AWS Enterprise" or "Bob the Cloud Mechanic".
This is classical geek owning up. My first thought was, this is written with two purposes:
1) to prevent the average, non-technical person from understanding it ("Phew, I'm glad these guys are figuring out this stuff and not me - that's why I host with them. I don't even know what a 'dyno manifold' is!")
2) to show management how smart we are and that you still need us ("because who else is going to figure this 'routing mesh' stuff out if you fire those responsible for the outage")
A simple "we're sorry and we've given 10 lashes to the engineer performing the manual garbage collection" may have been a better approach.
Having said that, I still think Heroku is awesome.
Read up on the heroku architecture. These are the terms used.
The manual garbage collection wasn't the problem. An unexpected data structure created by garbage collection wasn't handled in a fault tolerant manner.
Instead of "dyno", they could possibly use a word like "VM". Except that they're not really virtual machines, nor are they EC2 instances. Read Only Chroot Jails plus Precompiled Application, Libraries, and Environment (ROCJPALEs?) They also have a pretty complex set of support structures that provide connectivity to databases and other resources. Perhaps someone can suggest an existing name for that, but I know of none.
Instead of "manifold", perhaps they could use the word "cluster". Except it's not really a cluster, it's a set of distributed clusters. And nodes in a cluster are typically machines. The nodes in the dyno manifold aren't machines, virtual machines, they're ROCJPALEs. You could use the word "array", but again, it's not really an array. It's a multi-layered, geographically distributed structure of co-hosted application jails. "Manifold" seems as good a term as any.
"Streaming" seems like a good word. It's specifically relevant to this incident... they describe how the API is not atomic; that each message is built on top of the previous entries, and the data structures are implicit in the stream. That sounds like the definition of "streaming" to me.
"API" seems like a widely accepted term. They could've described it as a "protocol", perhaps. But neither seems more jargony than the other.
"Data"... well I suppose "streaming API" without the data would work. But it serves to differentiate it from a streaming video protocol.
"Mesh" has a very specific meaning. It means that you have a set of nodes that are connected peer-to-peer and that messages travel through the network by hopping from node to node. I'm assuming that their routing layer is organized in this way.
"Routing" is also pretty well defined. Requests come in and need to be sent to the machine that can serve responses to it. What would you call that instead of routing?
I feel like people who object to this kind of language are the same folks who object to the word "cloud". People don't take the time to understand different strategies to provisioning and application hosting APIs, and then think these words don't mean anything. Yeah, salespeople use the word to hustle the Same Old Shit, but it also actually means something to people like us who are actually building stuff.
Man, that's a long and contrived justification for what amounts to a pile of bullshit.
We have seen very elaborate post-mortems from google, facebook, twitter, and no least from Amazon themselves (you know, the playground that heroku builds their sandcastles in).
The aforementioned companies had no problem explaining their respective issues in plain language that every engineer did understand.
Heroku doesn't even try to explain themselves. They just throw around fantasy words without real explanations, seemingly overwhelmed by their own awesomeness (in a failure report, no less).
As an engineer I feel insulted by this pamphlet. All I can gather from it is that they screwed up and apparently somehow related to their request-routing layer. Thanks, we knew as much before reading that text.
I still have no idea what actually went wrong and how they intend to prevent it in the future. But I'll certainly advise people to avoid a company that babbles about "control rods" when their software screws up.
If I mechanically replace the words "routing mesh" with "load balancer", I instantly know what they're talking about without losing out on any important details.
Other than the fact that a load balancer is generally a monolithic piece of hardware. The failure modes are well defined, but most of them result in catastrophic outages.
I'm going to assume their routing mesh has many points of ingress and a larger number of exit paths (the dyno manifold), but that the nodes they've got participating in the mesh are actually in some sort of mesh topology (or form a connected graph).
This has the upside that if you lose several nodes in the mesh you probably haven't lost a path to any dynos. If you lose a whole AZ you can spin up new dynos in one of the existing ones and reconfigure the mesh quickly. My experience with loadbalancers, especially big load balancers is that updating a large swath of VIPs is NOT a fast operation (although you would start failing health checks on the missing nodes pretty quickly, adding new capacity to replace them is hard).
The mesh has the downside that the failure modes are a lot more complicated. Oh, and nobody knows what the hell you're talking about.
Of course, I could be wrong. They could just be using NetScalers (or ELB) and calling it a "routing mesh".
Oh, and nobody knows what the hell you're talking about.
Yes. It's a level of detail that borders on obfuscation.
fwiw, I've always used the term "load balancer" to also refer to two redundant load balancing machines. (If I worked with more complex load balancers, I doubt I'd stop.) In the general sense, it just means "the apparatus that balances the load".
This seems to be an unfortunate attempt to apply Corporate Speak to a technical announcement; "Let's see how many paragraphs we can fill with technical-sounding gibberish without actually telling anything..."
How about words like "ec2 instances", "erlang processes", "haproxy", "nginx" and similar stuff that was likely involved in the incident?
If they're too embarrassed to tell what happened then they should just keep quiet. Don't insult your customers with handwavy bullshit bingo, that just leaves a sour taste in everyones mouth.
Just imagine the hilarity when the PHB asks his inhouse engineer to translate this "post-mortem" into layman's terms for him. Most bosses have a bit of humor, but not when it comes to hosting infrastructure.
Heroku had to coin some words of their own to "mask" the fact that their services are but engineering on top of the AWS stack (which isn't to belittle the effort involved).