Hacker News new | ask | show | jobs
Netflix site is down (outage.report)
34 points by tomerific 3617 days ago
7 comments

Question:

Given the relatively limited amount of static content they distribute, and what seem like only daily updates, how is there not a switch they can flip when things go south to spin the service up in another region or on another provider?

Seems like it'd be the logical thing to do given AWS is always going to have another outage, and NFLX has lots of time and smart engineers to plan and prepare for these eventualities..

I think you might underestimate the scale here. It's more like, it's "always on, in all regions, with as many providers as they can". This is a company that's had to innovate in basically every possible business space it can to keep delivering what it has been.
It's also not entirely static, there's a lot of checks in place because of licensing models, and beyond that there's different encoding and quality levels of streams to support various clients.
I understand the site isn't static, but fundamentally what they are serving are static video streams. Encoding for video streams of varying quality levels is entirely pre-computed, and thus seem like static assets. Anyways, my gripe is that I am not seeing the good reason(s) for not having a working failover plan ready to go at all times for the service driving a publicly traded company. Even scale doesn't seem like a good reason, as I'm sure Google GCE would love to get a few slices of the Netflix pie. So I'm just left perplexed..
The video streams are delivered from their OpenConnect appliances. The video encoding, their actual website and all the client interaction is run in AWS, active/active in three regions (and multiple availability zones per region).

The AWS part is also very dynamic, at any given time most customers are (unknowingly/behind the scenes) participating in 8-10 beta features.

That said, this is all based on talks and presentations they have given at various conferences in the past. It could be different, especially some AWS parts.

Obviously they do have a failover plan, but no plan is infallible -- especially when it involves a complex distributed software system plus human decision-making.

You never notice all the times when the failover is executed smoothly with no interruption in service, just the times when something goes wrong.

And I promise that there are fail overs, simulations, testing, smaller issues, moving loads around, etc happening all the time behind the scenes. Getting caught out is no fun, but it's a very low percentage of the times when changing the tires on the bus driving down the freeway just goes [mostly] without a hitch.
Yes, you are right. Infact I was wondering the same too. They also make sure their systems are resilient by testing out scenarios as simple as one instance going down [1] to a whole data center going down [2] and yet this happens. I guess we have to wait till the post-mortem report comes in on this.

[1]- http://techblog.netflix.com/2012/07/chaos-monkey-released-in... [2]- http://techblog.netflix.com/2011/07/netflix-simian-army.html

They do this all the time. They switch back and forth between regions very frequently to test exactly this scenario.
It's still down too... on a Saturday night they must be absolutely hounded with complaints.
Working fine on my end.
Meta: why does https://outage.report redirect to http://outage.report ?
I have heard that some sites reported a 30% decline in ad revenue as soon as they started using HTTPS. I don't know the reason for this, tough.
The reason is probably because not all ad-networks support https, and it you can't make the same money on HTTPS-only ads (since fewer networks will bid on it). Putting up an HTTP ad would guarantee now that it is not seen, so sticking back to http makes sense that way.
Does Netflix have a standard status dashboard like other services do ?
Do any consumer sites have that? Seems like more of an enterprise-SLA type thing.
Interesting, I cannot login on their website, but I can access it on my phone.
I'm having fun watching the outrage on the Twitter feed.
It appears as if it just came back up for my region. Literally in the last five minutes.

Obviously people's mileage may vary, since it could be region dependant.