Hacker News new | ask | show | jobs
by alexleclair 3400 days ago
Yup, same here. It has been a few minutes already. Wanna bet the green checkmark[1] will stay green until the incident is resolved?

[1] https://status.aws.amazon.com/

15 comments

The red check mark is hosted on S3...
"Care to share the code as an anti pattern?" brilliant.
Comment of the year.
Fact of the year.
In December 2015 I received an e-mail with the following subject line from AWS, around 4 am in the morning:

"Amazon EC2 Instance scheduled for retirement"

When I checked the logs it was clear the hardware failed 30 mins before they scheduled it for retirement. EC2 and root device data was gone. The e-mail also said "you may have already lost data".

So I know that Amazon schedules servers for retirement after they already failed, green check doesn't surprise me.

So just as a FYI the reason that probably happened to you is that the underlying host was failing. I am assuming they wanted to give you a window to deal with it but the host croaked before then. I've been dealing w/ AWS for a long long time and I've never seen a maintenance event go early unless the physical hardware actually died...
That what happens when cloud provider doesn't support live migration for VMs.
That's completely ridiculous, get some fucking RAID Amazon.

I order drives off newegg directly to my DC and I'm yet to lose data with the cheapest drives available in RAID10.

Yes, solving problems at your scale and AWS' are quite comparable.
but I never lost data off an usb stick how hard could it be!
Really?!?! Several times USB sticks (and USB HDs) failed on me and other people I work with.
Not saying my scale is the the same at all - but the fact they can't do something so simple that I can do it as a single individual is embarrassing at best.

Simple solutions to this do scale - Linode and DigitalOcean don't have such issues for example - and while they're not Amazon scale, they are quite large and I'd say they prove the concept.

EBS data is backed up in multiple redundant ways (using erasure encoding I think).

Local storage is not intended for permanent storage, and is more use at your own risk. That's also why most of the new EC2 instances don't even support local storage.

Availability =/= durability of course

EBS is incredibly expensive and slow, not really a good solution. It'd be nice if they offered a better local storage option.
It's not just a RAID that can fail. And everyone who uses AWS should expect failures. You should build your infrastructure to handle such failures well.
They offer no RAID on local storage and only the expensive, IO restricted EBS as an alternative.
Yes, the only way a server can die is from non-raided disks.
Otherwise they should at least be providing customers their data back.
I think you misunderstood the local storage. It is not intended to permanently store data. It's a volatile storage like RAM.
It's crazy how much better the communication (including updates and status pages) is of the companies that rely on AWS than AWS' communication itself.

https://status.heroku.com/incidents/1059

Blake Gentry gave a full accounting of Heroku's response process here - http://www.heavybit.com/library/video/every-minute-counts-co...

Amazon should take notes.

I feel for them. Imagine, 40 or 50 different engineering teams all responsible for updating their statuses. At this moment on the AWS status page I see random usage of the red, yellow, and green icons, even though all the status updates are "Increased error rates." What that tells me is that there's no unified communication protocol across the teams, or they're not following it. And just imagine what it's like being on the S3 team right now.

I notice even Cloudflare is starting to have problems serving up pages now.

Font Awesome went out for me for a bit, but they did a great job getting back up and keeping their users in the loop.

https://status.fortawesome.com/

These service health boards are more like advertisement page then actual status of the service.
I guess their bizarre thinking is something along the lines of: "unless we have proof that noone can access the service, we won't change the indicator from green to yellow.

Seriously: I don't understand why you guys stay with AWS.

Because you perceive public clouds only as virtual machine providers, that you can replace with other provider in two days. A detailed cloud migration consists of replacing some parts of your software to use managed services provided by a specific cloud provider, and AWS is still has the best service offerings IMHO. When you use these services carefully also you will see that AWS is very cheap and reliable enough. Outages like today's are happening in every platform and it is possible to mitigate them.

You can use Adwords as a self-service user. Without knowing so much of details you can run your ads but also you can bery easily ruin your budget. But many enterprise customers use it very differently than those users and they are extremely optimizing the cost. Cloud is the same. If you don't know how big customers use AWS, it is normal that you are surprised because AWS is still leading the market.

You say GCP is better than AWS. Which part is better? GCP does not have many services of AWS we benefit from. How can you compare totally different providers? You can only say AWS EC2 is worse than GCP. But you cannot compare whole platforms in one sentence.

(Sorry, I'm late to reply, but since you addended your comment you might still be listening...)

After spending a year evaluating both AWS and GCP (with an emphasis on their managed database services; both SQL and no-SQL) my general feeling is this:

"Microsoft Windows is to Unix as AWS is to GCP".

(Or perhaps closer to the truth: "VMS is to Unix as AWS is to GCP".)

Baically AWS services seem like they are badly designed by buerocratic mediocre engineers following some bureocratic template for "a service".

GCP feels a lot saner (both API- and UI/console-wise). I often got the feeling it's designed by people who:

a) are smart and well-rounded in terms of experiences. It does take cleverness and experience to design something elegant that is also useful.

b) take pride in their work (it does show)

(And then, as a bonus: It's cheaper!)

You talk about SQL and No-SQL as managed services and it shows that your experience is limited to a classical application consisting of virtual machines and some data storage. However these are not the only services offered by both platforms and currently AWS has a richer feature set. For example Lambda and its deep integration with whole AWS platform is the biggest game changer from my point of view. If we are talking about virtual machines and databases, I can accept this comparison. However we are talking about 30+ services, some of them are even not available somewhere else and solving serious business problems in production and at scale. It is very wrong to put everything into basket and compare. Maybe GCP has better pub/sub service and AWS has better object storage. These should be compared seperately. Answering to your question, why do we still stay at AWS, because it is solving our problems in the most cost effective way and with reduced complexity, we are happy with it.
You're probably assuming too much again :)

I specifically spent a lot of time on Lambda and found it quite annoying compared to GCP AppEngine. So much bureaucracy. Just this thing that you have to specifically register every single Lambda API call and its parameters using an interface built by non-thinking people.. Sheesh.

For on-demand processing I just want a single HTTP-ish entry point, like AppEngine provides. (That way I can I move my service between different providers, if I wanted to move away from e.g. AWS.)

Sorry for endless number of typos and mistakes. Obviously I was sleepy while I was writing this.
> Seriously: I don't understand why you guys stay with AWS.

Personally I've been using it for ages and I know most services inside and out. They do suffer downtime in some regions occasionally, but it'd be too expensive at this point to move.

And who doesn't suffer downtime? You can't avoid it; you just need a plan to deal with it. For example, having a backup replica bucket in another region and the ability to quickly switch your CDN over would probably be a good idea here; that's what I did.

If you want to go further you can replicate your data to another cloud provider entirely and use low TTLs to switch to a backup CDN if your system is that mission-critical (in the event of a worldwide AWS failure doomsday scenario).

All systems will fail you and it's our responsibility as IT professionals to have a plan to mitigate this.

Low TTL on DNS entries might do more harm than good: if your DNS provider gets seriously DDoS, being able to rely on caches can save the day.

Anyway, I agree with your conclusion.

Sunk cost fallacy.

I do agree that we should all plan for failures.

However, I also think it's a sign of failure in planning and architecture foresight if it's too expensive to move away from a particular cloud provider.

The sunk cost fallacy is when you (irrationally) decide to stick with what you're doing purely because you've already spent a lot of resources on it. It doesn't apply when you've done an economic analysis and found out it doesn't make sense to swap.

There are plenty of cases where it just wouldn't make sense to switch after looking at the costs, opportunity costs, etc. For example, if his site makes him $10 a month, outages cost him $1 a month that could be mitigated by moving, and it would cost $1000 of labor to swap providers. (Depends on interest rates.)

Perhaps it was originally a failure to not have a plan to easily move from a provider, but it doesn't seem unreasonable to me that right now it may cost too many hours of work to justify the move.

It's not as though it would be impossible; our integration with AWS isn't that deep, it's not as though we use DynamoDB for our core data store or anything like that. But even migrating from one traditional datacenter to another isn't easy from an operational point of view.

There needs to be a clear financial win. Even taking into account the failures we've seen so far, I don't see a compelling reason to leave AWS.

(You're right, I used that term incorrectly.)

Still stand behind the other two points I made in that post though.

> I don't understand why you guys stay with AWS.

Who do you recommend instead (assuming in-house or Hetzner-equiv is out of reach)? Google Cloud? Azure? Rackspace?

Google Cloud if you're looking for something similar. It's just so much better and cheaper. I think a lot of the resistance here towards that kind of move is just because people are inherently lazy and they aren't paying the bill themselves.

(I'm guessing a relatively large part is also selfish attachment to the market leader because of employment reasons. I hate wasting money, both for myself and for my employer, so I don't really understand this kind of thinking - but I do understand how it could flourish in a venture capital-rich time/locale.)

I also recommend reading:

https://thehftguy.com/2016/06/15/gce-vs-aws-in-2016-why-you-...

Google Cloud doesn't exactly have the greatest reliability/uptime either.
https://status.cloud.google.com/summary tells a different story or do you have other information?

I have used GCP for some time without being affected from any incident.

Google also doesn't have the best record for developer tools.
GC's CDN doesn't cache files bigger than 4Mb. No Windows VMs. Bound to AWS for these 2 reasons.
As already mentioned, they do have Windows VM's but there are some caveats that indicate it's not fully baked yet. 1.) They require that each VM MUST have a public IP address so that Windows can talk to an activation server every 30 days. 2.) You cannot yet bring your own license.
Someone else already mentioned Windows VMs.

Looks like CDN has a 10MB limit:

https://cloud.google.com/cdn/docs/caching

(work at Google Cloud)

What about something like B2 from https://www.backblaze.com/ ?
S3 in a single region is based out of multiple data centres / availability zone, with data distributed so that the loss of a single availability zone won't impact either data availability or durability, even to the point of being comfortable with complete physical destruction of an AZ. The same applies for Azure, GCP etc.

B2 is based out of a single DC (or at least, was at launch and I don't see anything that suggests that has changed?) You've got to decide what's most important to you. Data persistence or $$$.

OVH
Bad idea there, support is horrible.
OVH doesn't even want to take my money to keep my server running. Their auto-billing process is busted and when it goes wrong they just delete your server.
What is your last datapoint on that?

The last year or two has seen a remarkable improvement according to those customers of mine that host there.

I think it's more, "if the service can't do what people need it to do, that's a problem; if the service cluster gets wedged hard enough to stop responding to the requests of our monitoring system, that's a failure."

Which would make sense (and is sorta-kinda a best-practice) if Amazon wrote services such that they "crashed early"—but instead they're seemingly written so the backend lock up and be rendered completely useless at "doing its job" but will continue to run just fine.

Either of those two design decisions is potentially a good thing on its own, but they need to be considered in light of one-another if you want your status page to make any sense. If you want to report cluster failures, code your clusters to actually fail. If you want to keep your clusters up, write your monitoring checks as whole-stack acceptance tests.

> Seriously: I don't understand why you guys stay with AWS.

You don't seem to have enough experience to comment on the issue.

Please visit this comment sub-tree:

https://news.ycombinator.com/item?id=13765786

That is a regurgitation of your opinion without any facts.

Comparing technology and saying "it seems" or "i feel" isn't really a good argument to convince me one way or the other.

> Seriously: I don't understand why you guys stay with AWS.

I tried them all and Amazon is still the best.

Postgres on RDS
Come to NEXT in a week! :).
Any chance UDF iterators for Cloud Bigtable are in the works?

Being able to run distributed D4M/GraphBLAS queries in Cloud Bigtable would be killer.

"From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph Algorithms inside a BigTable Database" https://arxiv.org/pdf/1606.07085.pdf

I'm seeing green checkmarks across the board, but they just added a notice to the top of the page:

> Increased Error Rates

> We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.

I guess sub-1% to 100% failure rate is technically an "increase".
I guess file uploads and downloads are technically "API calls".
the worst thing is when your system cant handle these "increased error rates" as your control plane cascades failure due to something like this....

The worst "increased error rate" problem I had was when the API was failing and my autoscale system couldnt deal and launched thousands of instances because it couldnt tell when instances were launched (lack of API access) and the instances pummelled the fuck out of all other parts of the system and we basically had to reboot the entire platform....

Luckily, amazon is REALLY forgiving with respect to costs in these (and actually most) circumstance....

recalls numerous times

Yes. Yes they are. Thankfully.

I always joke that if one of those statuses ever went to red, it means the zombie apocalypse has begun.
The number of non-green marks is the number of ICBMs currently in flight towards an AWS data center.
The good news is, if Amazon's services are marked as offline, you're allowed to use Amazon Lumberyard to control nuclear power plants.
In case anyone wants to see what mysterious the red icon looks like: https://status.aws.amazon.com/images/status3.gif

At best when there are problems (not like now I guess) I will see the "note" green icon https://status.aws.amazon.com/images/status1.gif

I've heard (on the Fnord new show on the most recent CCC congress, so take it with a grain of salt and a bucket of humor) that Amazon's TOS are more or less void when a Zombie Apocalypse breaks out.

They had some convoluted but fairly specific wording in their TOS, whoever wrote must have had a lot of fun.

From https://aws.amazon.com/service-terms/

> 57.10 Acceptable Use; Safety-Critical Systems. Your use of the Lumberyard Materials must comply with the AWS Acceptable Use Policy. The Lumberyard Materials are not intended for use with life-critical or safety-critical systems, such as use in operation of medical equipment, automated transportation systems, autonomous vehicles, aircraft or air traffic control, nuclear facilities, manned spacecraft, or military use in connection with live combat. However, this restriction will not apply in the event of the occurrence (certified by the United States Centers for Disease Control or successor body) of a widespread viral infection transmitted via bites or contact with bodily fluids that causes human corpses to reanimate and seek to consume living human flesh, blood, brain or nerve tissue and is likely to result in the fall of organized civilization.

First the fall of human civilization has to be a real threat per the TOS so not sure they'll care.

Second, I know the lawyer and yes he had fun.

Then I guess it has begun, the page is now showing red. I'd put a picture on imgur but it's not loading.
http://downdetector.com/status/aws-amazon-web-services looks like a reasonable alternative place to check/report downtime.
I just check Twitter, since Amazon's status is always a lie. My personal dashboard is still showing no problems. It's bad enough that the main public status is always green even when there's clearly a problem, but you'd think they could at least make the private status accurate.
Which is coincidently down.
Maybe they are hosted on S3 facepalm or maybe they just got a surge in traffic
downdetectordown.com ?
yep, page won't load.
Gah. It was up 3 minutes ago. Anyone have any suspicion this is another ddos episode? I saw that SO was down last night too: https://twitter.com/StackStatus/status/836450836322516992
Pretty confident that isn't it. S3 was returning InternalErrors for 22 seconds before it started timing out and/or returning 503s to all my requests.

I'd bet that something broke (causing InternalError responses) and then nodes started marking themselves as failed (causing the timeouts and 503s soon after).

I want to see the botnet capable DDoSing S3. That would be something.
Apparently, that's down too. Sigh.
So, global S3 outage for more than an hour now. Still green, still talking about "US East issue". I'm amazed.
It doesn't appear to be global; my app in eu-west-1 appears unaffected.

It's possible that the console won't work however as I believe that's served from us-east-1.

My site hosted on S3 is also running.
Looks like they have fixed the issue with their health dashboard now.

From https://status.aws.amazon.com/ : Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.

There was an alert on the personal health dashboard[1] a second ago, it said S3 Operational issue in us-east-1 but when I tried to view the details it showed an error.

Then I refreshed and the event disappeared altogether.

[1] https://phd.aws.amazon.com/phd/home?region=us-east-1#/dashbo...

Same here. But it is in the general status dashboard: http://status.aws.amazon.com/
Still green now, 8 minutes in.
I've had a few non-Amazon providers tell me AWS things are not working in the last 5 minutes, no note from Amazon though.

Nice.

Just sent out a notice to our customers via our status page. I really wanted to be able to add a link back to AWS detailing the issue but that's a pipe dream I suppose.
... still green
Looks like his personal site isn't loading... :)
Yup, it is indeed hosted with Amazon.
We have a slack emoji for it called greenish. It's the classic AWS green checkmark with an info icon in the bottom. Apparently it's NOT an outage if you don't acknowledge it. It's called alt-uptime.
I really liked it. But when trying to add it to my HipChat group it failed to upload. Why? S3 outage, what an irony.
AWS internal lingo calls this the "green-i"
Just went yellow

Edit: nevermind

Did it? Still fields of green for me.
While keeping the status green for s3, they have at least put up a notice at the top:

Increased Error Rates

We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.

Yeah I just now saw that. Probably regional cache clearing or something.
Still green for me
Just went yellow

Increased Error Rates

We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.

https://status.aws.amazon.com/

Check individual services ...

Amazon Simple Storage Service (US Standard) Service is operating normally