In December 2015 I received an e-mail with the following subject line from AWS, around 4 am in the morning:
"Amazon EC2 Instance scheduled for retirement"
When I checked the logs it was clear the hardware failed 30 mins before they scheduled it for retirement. EC2 and root device data was gone. The e-mail also said "you may have already lost data".
So I know that Amazon schedules servers for retirement after they already failed, green check doesn't surprise me.
So just as a FYI the reason that probably happened to you is that the underlying host was failing. I am assuming they wanted to give you a window to deal with it but the host croaked before then. I've been dealing w/ AWS for a long long time and I've never seen a maintenance event go early unless the physical hardware actually died...
Not saying my scale is the the same at all - but the fact they can't do something so simple that I can do it as a single individual is embarrassing at best.
Simple solutions to this do scale - Linode and DigitalOcean don't have such issues for example - and while they're not Amazon scale, they are quite large and I'd say they prove the concept.
EBS data is backed up in multiple redundant ways (using erasure encoding I think).
Local storage is not intended for permanent storage, and is more use at your own risk. That's also why most of the new EC2 instances don't even support local storage.
It's not just a RAID that can fail. And everyone who uses AWS should expect failures. You should build your infrastructure to handle such failures well.
I feel for them. Imagine, 40 or 50 different engineering teams all responsible for updating their statuses. At this moment on the AWS status page I see random usage of the red, yellow, and green icons, even though all the status updates are "Increased error rates." What that tells me is that there's no unified communication protocol across the teams, or they're not following it. And just imagine what it's like being on the S3 team right now.
I notice even Cloudflare is starting to have problems serving up pages now.
I guess their bizarre thinking is something along the lines of: "unless we have proof that noone can access the service, we won't change the indicator from green to yellow.
Seriously: I don't understand why you guys stay with AWS.
Because you perceive public clouds only as virtual machine providers, that you can replace with other provider in two days. A detailed cloud migration consists of replacing some parts of your software to use managed services provided by a specific cloud provider, and AWS is still has the best service offerings IMHO. When you use these services carefully also you will see that AWS is very cheap and reliable enough. Outages like today's are happening in every platform and it is possible to mitigate them.
You can use Adwords as a self-service user. Without knowing so much of details you can run your ads but also you can bery easily ruin your budget. But many enterprise customers use it very differently than those users and they are extremely optimizing the cost. Cloud is the same. If you don't know how big customers use AWS, it is normal that you are surprised because AWS is still leading the market.
You say GCP is better than AWS. Which part is better? GCP does not have many services of AWS we benefit from. How can you compare totally different providers? You can only say AWS EC2 is worse than GCP. But you cannot compare whole platforms in one sentence.
(Sorry, I'm late to reply, but since you addended your comment you might still be listening...)
After spending a year evaluating both AWS and GCP (with an emphasis on their managed database services; both SQL and no-SQL) my general feeling is this:
"Microsoft Windows is to Unix as AWS is to GCP".
(Or perhaps closer to the truth: "VMS is to Unix as AWS is to GCP".)
Baically AWS services seem like they are badly designed by buerocratic mediocre engineers following some bureocratic template for "a service".
GCP feels a lot saner (both API- and UI/console-wise). I often got the feeling it's designed by people who:
a) are smart and well-rounded in terms of experiences. It does take cleverness and experience to design something elegant that is also useful.
You talk about SQL and No-SQL as managed services and it shows that your experience is limited to a classical application consisting of virtual machines and some data storage. However these are not the only services offered by both platforms and currently AWS has a richer feature set. For example Lambda and its deep integration with whole AWS platform is the biggest game changer from my point of view. If we are talking about virtual machines and databases, I can accept this comparison. However we are talking about 30+ services, some of them are even not available somewhere else and solving serious business problems in production and at scale. It is very wrong to put everything into basket and compare. Maybe GCP has better pub/sub service and AWS has better object storage. These should be compared seperately. Answering to your question, why do we still stay at AWS, because it is solving our problems in the most cost effective way and with reduced complexity, we are happy with it.
I specifically spent a lot of time on Lambda and found it quite annoying compared to GCP AppEngine. So much bureaucracy. Just this thing that you have to specifically register every single Lambda API call and its parameters using an interface built by non-thinking people.. Sheesh.
For on-demand processing I just want a single HTTP-ish entry point, like AppEngine provides. (That way I can I move my service between different providers, if I wanted to move away from e.g. AWS.)
> Seriously: I don't understand why you guys stay with AWS.
Personally I've been using it for ages and I know most services inside and out. They do suffer downtime in some regions occasionally, but it'd be too expensive at this point to move.
And who doesn't suffer downtime? You can't avoid it; you just need a plan to deal with it. For example, having a backup replica bucket in another region and the ability to quickly switch your CDN over would probably be a good idea here; that's what I did.
If you want to go further you can replicate your data to another cloud provider entirely and use low TTLs to switch to a backup CDN if your system is that mission-critical (in the event of a worldwide AWS failure doomsday scenario).
All systems will fail you and it's our responsibility as IT professionals to have a plan to mitigate this.
However, I also think it's a sign of failure in planning and architecture foresight if it's too expensive to move away from a particular cloud provider.
The sunk cost fallacy is when you (irrationally) decide to stick with what you're doing purely because you've already spent a lot of resources on it. It doesn't apply when you've done an economic analysis and found out it doesn't make sense to swap.
There are plenty of cases where it just wouldn't make sense to switch after looking at the costs, opportunity costs, etc. For example, if his site makes him $10 a month, outages cost him $1 a month that could be mitigated by moving, and it would cost $1000 of labor to swap providers. (Depends on interest rates.)
Perhaps it was originally a failure to not have a plan to easily move from a provider, but it doesn't seem unreasonable to me that right now it may cost too many hours of work to justify the move.
It's not as though it would be impossible; our integration with AWS isn't that deep, it's not as though we use DynamoDB for our core data store or anything like that. But even migrating from one traditional datacenter to another isn't easy from an operational point of view.
There needs to be a clear financial win. Even taking into account the failures we've seen so far, I don't see a compelling reason to leave AWS.
Google Cloud if you're looking for something similar. It's just so much better and cheaper. I think a lot of the resistance here towards that kind of move is just because people are inherently lazy and they aren't paying the bill themselves.
(I'm guessing a relatively large part is also selfish attachment to the market leader because of employment reasons. I hate wasting money, both for myself and for my employer, so I don't really understand this kind of thinking - but I do understand how it could flourish in a venture capital-rich time/locale.)
As already mentioned, they do have Windows VM's but there are some caveats that indicate it's not fully baked yet. 1.) They require that each VM MUST have a public IP address so that Windows can talk to an activation server every 30 days. 2.) You cannot yet bring your own license.
S3 in a single region is based out of multiple data centres / availability zone, with data distributed so that the loss of a single availability zone won't impact either data availability or durability, even to the point of being comfortable with complete physical destruction of an AZ. The same applies for Azure, GCP etc.
B2 is based out of a single DC (or at least, was at launch and I don't see anything that suggests that has changed?) You've got to decide what's most important to you. Data persistence or $$$.
OVH doesn't even want to take my money to keep my server running. Their auto-billing process is busted and when it goes wrong they just delete your server.
I think it's more, "if the service can't do what people need it to do, that's a problem; if the service cluster gets wedged hard enough to stop responding to the requests of our monitoring system, that's a failure."
Which would make sense (and is sorta-kinda a best-practice) if Amazon wrote services such that they "crashed early"—but instead they're seemingly written so the backend lock up and be rendered completely useless at "doing its job" but will continue to run just fine.
Either of those two design decisions is potentially a good thing on its own, but they need to be considered in light of one-another if you want your status page to make any sense. If you want to report cluster failures, code your clusters to actually fail. If you want to keep your clusters up, write your monitoring checks as whole-stack acceptance tests.
the worst thing is when your system cant handle these "increased error rates" as your control plane cascades failure due to something like this....
The worst "increased error rate" problem I had was when the API was failing and my autoscale system couldnt deal and launched thousands of instances because it couldnt tell when instances were launched (lack of API access) and the instances pummelled the fuck out of all other parts of the system and we basically had to reboot the entire platform....
Luckily, amazon is REALLY forgiving with respect to costs in these (and actually most) circumstance....
I've heard (on the Fnord new show on the most recent CCC congress, so take it with a grain of salt and a bucket of humor) that Amazon's TOS are more or less void when a Zombie Apocalypse breaks out.
They had some convoluted but fairly specific wording in their TOS, whoever wrote must have had a lot of fun.
> 57.10 Acceptable Use; Safety-Critical Systems. Your use of the Lumberyard Materials must comply with the AWS Acceptable Use Policy. The Lumberyard Materials are not intended for use with life-critical or safety-critical systems, such as use in operation of medical equipment, automated transportation systems, autonomous vehicles, aircraft or air traffic control, nuclear facilities, manned spacecraft, or military use in connection with live combat. However, this restriction will not apply in the event of the occurrence (certified by the United States Centers for Disease Control or successor body) of a widespread viral infection transmitted via bites or contact with bodily fluids that causes human corpses to reanimate and seek to consume living human flesh, blood, brain or nerve tissue and is likely to result in the fall of organized civilization.
I just check Twitter, since Amazon's status is always a lie. My personal dashboard is still showing no problems. It's bad enough that the main public status is always green even when there's clearly a problem, but you'd think they could at least make the private status accurate.
Pretty confident that isn't it. S3 was returning InternalErrors for 22 seconds before it started timing out and/or returning 503s to all my requests.
I'd bet that something broke (causing InternalError responses) and then nodes started marking themselves as failed (causing the timeouts and 503s soon after).
Looks like they have fixed the issue with their health dashboard now.
From https://status.aws.amazon.com/ : Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
There was an alert on the personal health dashboard[1] a second ago, it said S3 Operational issue in us-east-1 but when I tried to view the details it showed an error.
Then I refreshed and the event disappeared altogether.
Just sent out a notice to our customers via our status page. I really wanted to be able to add a link back to AWS detailing the issue but that's a pipe dream I suppose.
We have a slack emoji for it called greenish. It's the classic AWS green checkmark with an info icon in the bottom. Apparently it's NOT an outage if you don't acknowledge it. It's called alt-uptime.