Given the relatively limited amount of static content they distribute, and what seem like only daily updates, how is there not a switch they can flip when things go south to spin the service up in another region or on another provider?
Seems like it'd be the logical thing to do given AWS is always going to have another outage, and NFLX has lots of time and smart engineers to plan and prepare for these eventualities..
I think you might underestimate the scale here. It's more like, it's "always on, in all regions, with as many providers as they can". This is a company that's had to innovate in basically every possible business space it can to keep delivering what it has been.
It's also not entirely static, there's a lot of checks in place because of licensing models, and beyond that there's different encoding and quality levels of streams to support various clients.
I understand the site isn't static, but fundamentally what they are serving are static video streams. Encoding for video streams of varying quality levels is entirely pre-computed, and thus seem like static assets. Anyways, my gripe is that I am not seeing the good reason(s) for not having a working failover plan ready to go at all times for the service driving a publicly traded company. Even scale doesn't seem like a good reason, as I'm sure Google GCE would love to get a few slices of the Netflix pie. So I'm just left perplexed..
The video streams are delivered from their OpenConnect appliances. The video encoding, their actual website and all the client interaction is run in AWS, active/active in three regions (and multiple availability zones per region).
The AWS part is also very dynamic, at any given time most customers are (unknowingly/behind the scenes) participating in 8-10 beta features.
That said, this is all based on talks and presentations they have given at various conferences in the past. It could be different, especially some AWS parts.
Obviously they do have a failover plan, but no plan is infallible -- especially when it involves a complex distributed software system plus human decision-making.
You never notice all the times when the failover is executed smoothly with no interruption in service, just the times when something goes wrong.
And I promise that there are fail overs, simulations, testing, smaller issues, moving loads around, etc happening all the time behind the scenes. Getting caught out is no fun, but it's a very low percentage of the times when changing the tires on the bus driving down the freeway just goes [mostly] without a hitch.
Yes, you are right. Infact I was wondering the same too. They also make sure their systems are resilient by testing out scenarios as simple as one instance going down [1] to a whole data center going down [2] and yet this happens. I guess we have to wait till the post-mortem report comes in on this.
The reason is probably because not all ad-networks support https, and it you can't make the same money on HTTPS-only ads (since fewer networks will bid on it). Putting up an HTTP ad would guarantee now that it is not seen, so sticking back to http makes sense that way.
Given the relatively limited amount of static content they distribute, and what seem like only daily updates, how is there not a switch they can flip when things go south to spin the service up in another region or on another provider?
Seems like it'd be the logical thing to do given AWS is always going to have another outage, and NFLX has lots of time and smart engineers to plan and prepare for these eventualities..