Hacker News new | ask | show | jobs
Heroku: a follow up on last week's outage (status.heroku.com)
81 points by twampss 5125 days ago
9 comments

Hey guys I got this, I speak cloudonaut. Here I'll translate it to sysadmin:

An admin was doing a rolling restart that triggered a bug in the loadbalancer software. The auto restart script turned out to just make things worse by restarting it over and over (they always do), so we thought we'd just quick throw spare capacity at it, but turns out that never works in a panicked rush either. Also, our system designed to handle outage notifications wasn't capacity planned, like, at all.

I know this is a joke, but from the sounds of the errors that isn't far from true. This is also basically what happened with the big two-day outage at Amazon a while back. It's always the automated processes that come back to bite you it seems.

I know I have had my share of server issues, but it seems to me that many 'cloud' services out there are simply adding too many layers of abstraction that tend to make things very, very touchy to any small issue occurring. Because of this I try to keep my server stacks/frameworks as basic as possible while still implementing performance oriented services like NoSQL, caching, etc.

Although I have my fair share of hesitance at worshiping cloud services, the fact that a service is "cloud" has nothing to do with the quality of its architecture. You can make a crucial architecture mistake designing a fleet of dedicated PHP servers talking to a MySQL cluster just as easily as you can building atop some cloud service.
I love Heroku, but am I the only one that thinks their choice of words describing their architecture is a bit pretentious?

"...streaming data API which connects the dyno manifold to the routing mesh."

Give me a break!

Well, there's the problem right there. They're using a dyno manifold to connect to the routing mesh. If they'd just use a flux capacitor, they could use a static manifold instead.
Love it. Times like this I wish there were more of a sketch comedy scene in our community... Not sure whether to go with "The AWS Enterprise" or "Bob the Cloud Mechanic".
Reminds me of the good ol' Turbo Encabulator...
This is classical geek owning up. My first thought was, this is written with two purposes:

1) to prevent the average, non-technical person from understanding it ("Phew, I'm glad these guys are figuring out this stuff and not me - that's why I host with them. I don't even know what a 'dyno manifold' is!")

2) to show management how smart we are and that you still need us ("because who else is going to figure this 'routing mesh' stuff out if you fire those responsible for the outage")

A simple "we're sorry and we've given 10 lashes to the engineer performing the manual garbage collection" may have been a better approach.

Having said that, I still think Heroku is awesome.

Read up on the heroku architecture. These are the terms used.

The manual garbage collection wasn't the problem. An unexpected data structure created by garbage collection wasn't handled in a fault tolerant manner.

All the reading in the world will not change the fact they pick retarded names.

Routing mesh? We call that a cluster of load balancers in the real world.

I can guarantee you that the Heroku architecture uses an internal slang for common sysadmin concepts.
OK, I'll bite.

Instead of "dyno", they could possibly use a word like "VM". Except that they're not really virtual machines, nor are they EC2 instances. Read Only Chroot Jails plus Precompiled Application, Libraries, and Environment (ROCJPALEs?) They also have a pretty complex set of support structures that provide connectivity to databases and other resources. Perhaps someone can suggest an existing name for that, but I know of none.

Instead of "manifold", perhaps they could use the word "cluster". Except it's not really a cluster, it's a set of distributed clusters. And nodes in a cluster are typically machines. The nodes in the dyno manifold aren't machines, virtual machines, they're ROCJPALEs. You could use the word "array", but again, it's not really an array. It's a multi-layered, geographically distributed structure of co-hosted application jails. "Manifold" seems as good a term as any.

"Streaming" seems like a good word. It's specifically relevant to this incident... they describe how the API is not atomic; that each message is built on top of the previous entries, and the data structures are implicit in the stream. That sounds like the definition of "streaming" to me.

"API" seems like a widely accepted term. They could've described it as a "protocol", perhaps. But neither seems more jargony than the other.

"Data"... well I suppose "streaming API" without the data would work. But it serves to differentiate it from a streaming video protocol.

"Mesh" has a very specific meaning. It means that you have a set of nodes that are connected peer-to-peer and that messages travel through the network by hopping from node to node. I'm assuming that their routing layer is organized in this way.

"Routing" is also pretty well defined. Requests come in and need to be sent to the machine that can serve responses to it. What would you call that instead of routing?

I feel like people who object to this kind of language are the same folks who object to the word "cloud". People don't take the time to understand different strategies to provisioning and application hosting APIs, and then think these words don't mean anything. Yeah, salespeople use the word to hustle the Same Old Shit, but it also actually means something to people like us who are actually building stuff.

Man, that's a long and contrived justification for what amounts to a pile of bullshit.

We have seen very elaborate post-mortems from google, facebook, twitter, and no least from Amazon themselves (you know, the playground that heroku builds their sandcastles in).

The aforementioned companies had no problem explaining their respective issues in plain language that every engineer did understand.

Heroku doesn't even try to explain themselves. They just throw around fantasy words without real explanations, seemingly overwhelmed by their own awesomeness (in a failure report, no less).

As an engineer I feel insulted by this pamphlet. All I can gather from it is that they screwed up and apparently somehow related to their request-routing layer. Thanks, we knew as much before reading that text.

I still have no idea what actually went wrong and how they intend to prevent it in the future. But I'll certainly advise people to avoid a company that babbles about "control rods" when their software screws up.

Are you a Heroku customer? I am, and I understand everything they said, and I appreciate that they went into detail about what happened.
If I mechanically replace the words "routing mesh" with "load balancer", I instantly know what they're talking about without losing out on any important details.
Other than the fact that a load balancer is generally a monolithic piece of hardware. The failure modes are well defined, but most of them result in catastrophic outages.

I'm going to assume their routing mesh has many points of ingress and a larger number of exit paths (the dyno manifold), but that the nodes they've got participating in the mesh are actually in some sort of mesh topology (or form a connected graph).

This has the upside that if you lose several nodes in the mesh you probably haven't lost a path to any dynos. If you lose a whole AZ you can spin up new dynos in one of the existing ones and reconfigure the mesh quickly. My experience with loadbalancers, especially big load balancers is that updating a large swath of VIPs is NOT a fast operation (although you would start failing health checks on the missing nodes pretty quickly, adding new capacity to replace them is hard).

The mesh has the downside that the failure modes are a lot more complicated. Oh, and nobody knows what the hell you're talking about.

Of course, I could be wrong. They could just be using NetScalers (or ELB) and calling it a "routing mesh".

Oh, and nobody knows what the hell you're talking about.

Yes. It's a level of detail that borders on obfuscation.

fwiw, I've always used the term "load balancer" to also refer to two redundant load balancing machines. (If I worked with more complex load balancers, I doubt I'd stop.) In the general sense, it just means "the apparatus that balances the load".

It's "dyno manifold" that's the issue here. However, a quick Google search took me here:

https://devcenter.heroku.com/articles/dynos

It explains all :-)

Pretentious is an understatement.

This seems to be an unfortunate attempt to apply Corporate Speak to a technical announcement; "Let's see how many paragraphs we can fill with technical-sounding gibberish without actually telling anything..."

If a future update mentions 'phase modulation' we'll know they're just cribbing excuses from old Star Trek episodes.
Nothing a directed tachyon beam can't fix.
It might be pretentious. But it serves as (1) sales-speak and (2) provides a frame of metaphor that may well assist the engineers internally.
How else would you describe them?
How about words like "ec2 instances", "erlang processes", "haproxy", "nginx" and similar stuff that was likely involved in the incident?

If they're too embarrassed to tell what happened then they should just keep quiet. Don't insult your customers with handwavy bullshit bingo, that just leaves a sour taste in everyones mouth.

Just imagine the hilarity when the PHB asks his inhouse engineer to translate this "post-mortem" into layman's terms for him. Most bosses have a bit of humor, but not when it comes to hosting infrastructure.

Amazon also has its own fancy vocabulary but instead of cool sounding words like manifold they prefer short acronyms.

Things like EC2 RDS AWS S3 EBS EMR IAM AMI SQS SNS SES HPC VPC

But Amazon is the reference in cloud hosting and these terms are well understood in the field.

Heroku also had to coin some words to describe their architecture. But frankly this outage report is worthy of an Hollywood hacker movie:

"A manual garbage collection process which created an unusual record in the data stream" Wow!

Heroku had to coin some words of their own to "mask" the fact that their services are but engineering on top of the AWS stack (which isn't to belittle the effort involved).
"connects the tachyon emitter to the warp nacelles"?
"The first root cause is related to the streaming data API which connects the dyno manifold to the routing mesh. On the dyno management side, an engineer was performing a manual garbage collection process which created an unusual record in the data stream. On the routing side, a bug in the subprocess of the router which processes the incoming stream saw the record as garbage."

This is techno-babble on a scale the world has never seen!

The only unusual terms in that paragraph are "dyno manifold" and "routing mesh", both of which are Heroku-specific technologies that Heroku users should know of. The rest is just normal systems stuff.
I can understand the words individually, but I've never before seen or heard of a manual garbage collection process creating an unusual record in a data stream. It's like a tech jargon ad lib.
I like the part where some guy/gal with a stop watch was keeping records of every minute.
Their interactive engineer mesh stream handles this time keeping.

(That is, probably an irc channel..)

The problem with Heroku is that you need to be a certain level of tech savvy to make use of their services.

We're expecting them to be the A-Grade tech wizards who can give us 0 down time. They are after expecting thousands of people to trust their services and to outsource the server hosting and administration duties to them.

So they tread the fine line between convenience (and related "cloud" benefits) and "I can do this myself".

If they can't give us the assurances that they can do it better, cheaper and more reliably than we can do it ourselves then what good are they?

If they can't capacity plan a simple System Status page (running on Rackspace) and keep that up and running then what good are they?

And since their service appeals to a certain level of geek competence, they also can't get away with techno babble bull shit responses to outages.

"We're expecting them to be the A-Grade tech wizards who can give us 0 down time."

Exactly. But something like this which they said makes them seem so ordinary:

"The improved status site allows users to subscribe to notifications when an incident is opened. As a result, our status site experienced unprecedented spikes of load during this incident. This high load crushed the site,"

Basically saying whatever they setup for a status site choked on sending out emails or sms, as if they were hosted on a shared server and got mentioned simultaneously on a few major sites.

Quite a few Erlang gotchas in those notes. Fault tolerant systems are really hard to design even when you know what you're doing and are using the best language for it (Erlang). Erlang aside, it seems the higher level architecture may need a rethink if one bad record can bring down the whole thing.
It looks like the error recovery code wasn't well tested. Error recovery code in distributed systems is some of the hardest code to test effectively mind.

The thundering herd of recovery is especially difficult to cope with: your error recovery code can work just fine for normal outages but then fail completely when faced with just a few more components going dark.

weird. they had a problem caused by a series of bugs, yet the word "test" doesn't appear anywhere in that page.
Testing distributed systems is much, much harder than doing so on a monolithic codebase. The number of failure modes goes up very rapidly with the number of nodes in the system & your code has to (in principle) cope with every possible one.
true, but it sounds like they (and perhaps you) have never even heard of the chaos monkey.
Randomly killing instances wouldn't have detected this particular failure mode as far as I can see, since the error lay in the inability to resurrect a failed process under certain circumstances.
Surely I'm not the only one infuriated by their choice of blue text on a blue background?
I was all prepared to be angry, but it's actually ok, to be honest. I normally dislike dark themes.
I tend to feel much less angry about outages and such when the businesses take time to explained what happened, of course this doesn't justify the lack of up-time but I like the gesture!
Sounds about one inch away from an AFJ.