Hacker News new | ask | show | jobs
by alrs 4400 days ago
Joyent's messaging about "we're cloud, but with perfect uptime" was always broken.

It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.

I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.

99.999 sounds stuck-in-the-90s.

3 comments

> sounds like they're throwing a sysadmin under the bus

at least they didn't name the operator in question...

Our internal culture is such that everyone on the team would rather be blamed for something than accuse someone else of doing it. That's shitty, and not something you do to someone. You fix the problem and then you move on.

If it makes you happy, blame me - I don't mind.

At my $DAYJOB, we are always careful to figure out exactly what happened, including by whom. It's not to assign personal blame, but I believe it's critical that everyone agrees on the facts (who, what, when, where, and [if possible] why).

Response and conversation is always focused on "how do we prevent this in the future?", not on punishing whoever was involved in the past.

IOW, I agree with I believe is your intent, but differ on the implementation. Blameless transparency is the term we use (and we probably stole that from somewhere else).

It's a very powerful signal to the whole team when you first see individuals "admitting" to exactly what they did, how it caused or contributed to the outage, and to hear them thanked for their contribution of understanding in the post-mortem.

Senior leadership (including myself, who originally instituted the entire process a decade ago) is very clear that we want to know the facts and that in seeking and using those facts, we're only focused on the future, no matter how boneheaded the individual actions appear with the benefit of hindsight and knowledge that they'd lead (in)directly to an outage. I run operations and also participate in the promotion discussions for all technologists, and in 11 years, I've never heard a negative shadow cast onto a sysadmin/sysengineer from their actions during or leading to a production outage. And we've (collectively) made our fair share of mistakes over the years. That doesn't stop good employees from feeling bad about it, but that's a personal feeling they have, not from the fear of it being a professional black mark.

I think there's a difference in how you approach this with an internal-facing view and an external one.

Internally, You're right. But externally the company fucked up, not the individual.

100% agree, and it is my oversight to not draw that distinction more clearly. We have the luxury (so far) of only reporting internally.
BTW, this is the right way to do it. :)
"elijahwright" shall henceforth be used in place of "scapegoat"
Awesome! It's what I've always wanted!!!
it was that way at tech, no reason for it to change now
Now I have to figure out who you are. :-)
Sure blame on the engineers. You give power, people use it badly blame the engineer for giving too much power. You don't give enough power sysadmins/users bitch and yell why don't we have enough power, we're not children.

Its always the engineer fault. :(

Systems engineers, software engineers, architects, whatever. We're all in the same gang.

My point is that the problem in this case is likely the system's design, not one engineer's typing abilities.

This comes down to operational philosophy, in the end. The point you're dancing around is whether the system should permit grave actions that don't make any sense when you're designing the system.

By the rules, every single system on a commercial aircraft has a circuit breaker. Pilots make the "what if X catches on fire?" case, which is actually pretty compelling. However, that also means there are several switches overhead that will ostensibly crash the airplane if pulled. Pilots lobby very strongly for the aircraft not to fight them in any way because they are the only ones with the data, in the moment, now. They have final command over the aircraft in every way.

I use this to point out that as you're designing systems for operations people -- something we're increasingly doing ourselves as devops/SRE takes hold -- you might think you can anticipate every scenario and design suitable safeguards into the system. However, sometimes, when Halley's Comet refracts some moonlight into swamp gas and takes your fleet down, you as an operator have to do some really crazy shit. It's in that moment, when all hell has broken loose, I'm at the helm, and based on the data available to me I have made a decision to shoot the system in the head: if the system fights me and prolongs an outage because we argued about whether we'd ever need to reboot a fleet all at once, I'm replacing the system as the first item in my postmortem. If you make me walk row to row flipping PDUs, we're going to have words.

That's just my philosophy. Give the operators the knives and let them cut themselves, trusting that you've hired smart people and understanding mistakes will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me for a physical key, even. But if you ever prevent me from doing what I know must be done, you are in my way. I have yet to meet a system that is smarter than an operator when the shit hits the fan (especially when the shit hits the fan).

There's probably a broader term for operational philosophy like this.

...and the operations version of that is that all normal operations are performed under restricted permissions that cannot "do anything", while the full "do anything" permissions are only broken out during a major crisis.

Such an approach would have prevented this incident where "normal" operations were being performed and accidentally ALL the servers were rebooted at once.

I tend to agree with you, with the caveat that you can't have this philosophy and sell your customers 99.999% uptime[0].

[0] http://www.joyent.com/products/compute-service/features/linu...

I disagree wholeheartedly. Your operational philosophy complements your SLA goals, it doesn't force them.
I can't figure out how your comment that "understanding mistakes will happen" is compatible with 99.999% uptime.

I'm of the opinion that 99.999% for an individual instance isn't particularly achievable in a commodity hosting environment. That kind of uptime doesn't leave much room for the mistakes that you and I both anticipate.

I do think that 99.999% is doable for a properly distributed whole-system across multiple geographically-dispersed datacenters.

I think Joyent has gone wrong in promoting individual instance reliability.

They're not. That's a statement of what customers have enjoyed up until now. The actual SLA simply states what refund you get for each hour of downtime.
It's a combined fault. Clearly the operator made a mistake, but the system shouldn't have let such a calamitous operation take place without at least three levels of "Are you sure" (or something smarter like "Confirm how many servers would you like to reboot:") before it lets you take down thousands of servers.
Joyent's marketing is not the most transparent. They haven't updated AWS prices in their pricing page since AWS lowered their prices two months ago.
What?

Joyent doesn't use AWS.