| HN Mirror

https://twitter.com/awscloud/status/836656664635846656

ak2196 3400 days ago

The truth is stranger than fiction.

crowbahr 3399 days ago

"Care to share the code as an anti pattern?" brilliant.

AtheistOfFail 3400 days ago

Comment of the year.

Kiro 3400 days ago

Fact of the year.

emrekzd 3400 days ago

In December 2015 I received an e-mail with the following subject line from AWS, around 4 am in the morning:

"Amazon EC2 Instance scheduled for retirement"

When I checked the logs it was clear the hardware failed 30 mins before they scheduled it for retirement. EC2 and root device data was gone. The e-mail also said "you may have already lost data".

So I know that Amazon schedules servers for retirement after they already failed, green check doesn't surprise me.

smoodles 3400 days ago

So just as a FYI the reason that probably happened to you is that the underlying host was failing. I am assuming they wanted to give you a window to deal with it but the host croaked before then. I've been dealing w/ AWS for a long long time and I've never seen a maintenance event go early unless the physical hardware actually died...

amaks 3399 days ago

That what happens when cloud provider doesn't support live migration for VMs.

That's completely ridiculous, get some fucking RAID Amazon.

I order drives off newegg directly to my DC and I'm yet to lose data with the cheapest drives available in RAID10.

prdonahue 3400 days ago

Yes, solving problems at your scale and AWS' are quite comparable.

LoSboccacc 3399 days ago

but I never lost data off an usb stick how hard could it be!

aurelianito 3399 days ago

Really?!?! Several times USB sticks (and USB HDs) failed on me and other people I work with.

Not saying my scale is the the same at all - but the fact they can't do something so simple that I can do it as a single individual is embarrassing at best.

Simple solutions to this do scale - Linode and DigitalOcean don't have such issues for example - and while they're not Amazon scale, they are quite large and I'd say they prove the concept.

phonon 3400 days ago

EBS data is backed up in multiple redundant ways (using erasure encoding I think).

Local storage is not intended for permanent storage, and is more use at your own risk. That's also why most of the new EC2 instances don't even support local storage.

Availability =/= durability of course

EBS is incredibly expensive and slow, not really a good solution. It'd be nice if they offered a better local storage option.

foxylion 3399 days ago

It's not just a RAID that can fail. And everyone who uses AWS should expect failures. You should build your infrastructure to handle such failures well.

problems 3399 days ago

They offer no RAID on local storage and only the expensive, IO restricted EBS as an alternative.

vacri 3400 days ago

Yes, the only way a server can die is from non-raided disks.

Otherwise they should at least be providing customers their data back.

foxylion 3399 days ago

I think you misunderstood the local storage. It is not intended to permanently store data. It's a volatile storage like RAM.

https://status.heroku.com/incidents/1059

tuna-piano 3400 days ago

It's crazy how much better the communication (including updates and status pages) is of the companies that rely on AWS than AWS' communication itself.

tcsf 3400 days ago

Blake Gentry gave a full accounting of Heroku's response process here - http://www.heavybit.com/library/video/every-minute-counts-co...

Amazon should take notes.

all_usernames 3400 days ago

I feel for them. Imagine, 40 or 50 different engineering teams all responsible for updating their statuses. At this moment on the AWS status page I see random usage of the red, yellow, and green icons, even though all the status updates are "Increased error rates." What that tells me is that there's no unified communication protocol across the teams, or they're not following it. And just imagine what it's like being on the S3 team right now.

I notice even Cloudflare is starting to have problems serving up pages now.

https://status.fortawesome.com/

SnowingXIV 3399 days ago

Font Awesome went out for me for a bit, but they did a great job getting back up and keeping their users in the loop.

busterarm 3400 days ago

Also, https://status.pantheon.io

tlogan 3400 days ago

These service health boards are more like advertisement page then actual status of the service.

mwfj 3400 days ago

I guess their bizarre thinking is something along the lines of: "unless we have proof that noone can access the service, we won't change the indicator from green to yellow.

Seriously: I don't understand why you guys stay with AWS.

cagataygurturk 3400 days ago

Because you perceive public clouds only as virtual machine providers, that you can replace with other provider in two days. A detailed cloud migration consists of replacing some parts of your software to use managed services provided by a specific cloud provider, and AWS is still has the best service offerings IMHO. When you use these services carefully also you will see that AWS is very cheap and reliable enough. Outages like today's are happening in every platform and it is possible to mitigate them.

You can use Adwords as a self-service user. Without knowing so much of details you can run your ads but also you can bery easily ruin your budget. But many enterprise customers use it very differently than those users and they are extremely optimizing the cost. Cloud is the same. If you don't know how big customers use AWS, it is normal that you are surprised because AWS is still leading the market.

You say GCP is better than AWS. Which part is better? GCP does not have many services of AWS we benefit from. How can you compare totally different providers? You can only say AWS EC2 is worse than GCP. But you cannot compare whole platforms in one sentence.

johansch 3399 days ago

(Sorry, I'm late to reply, but since you addended your comment you might still be listening...)

After spending a year evaluating both AWS and GCP (with an emphasis on their managed database services; both SQL and no-SQL) my general feeling is this:

"Microsoft Windows is to Unix as AWS is to GCP".

(Or perhaps closer to the truth: "VMS is to Unix as AWS is to GCP".)

Baically AWS services seem like they are badly designed by buerocratic mediocre engineers following some bureocratic template for "a service".

GCP feels a lot saner (both API- and UI/console-wise). I often got the feeling it's designed by people who:

a) are smart and well-rounded in terms of experiences. It does take cleverness and experience to design something elegant that is also useful.

b) take pride in their work (it does show)

(And then, as a bonus: It's cheaper!)

cagataygurturk 3399 days ago

You talk about SQL and No-SQL as managed services and it shows that your experience is limited to a classical application consisting of virtual machines and some data storage. However these are not the only services offered by both platforms and currently AWS has a richer feature set. For example Lambda and its deep integration with whole AWS platform is the biggest game changer from my point of view. If we are talking about virtual machines and databases, I can accept this comparison. However we are talking about 30+ services, some of them are even not available somewhere else and solving serious business problems in production and at scale. It is very wrong to put everything into basket and compare. Maybe GCP has better pub/sub service and AWS has better object storage. These should be compared seperately. Answering to your question, why do we still stay at AWS, because it is solving our problems in the most cost effective way and with reduced complexity, we are happy with it.

johansch 3399 days ago

You're probably assuming too much again :)

I specifically spent a lot of time on Lambda and found it quite annoying compared to GCP AppEngine. So much bureaucracy. Just this thing that you have to specifically register every single Lambda API call and its parameters using an interface built by non-thinking people.. Sheesh.

For on-demand processing I just want a single HTTP-ish entry point, like AppEngine provides. (That way I can I move my service between different providers, if I wanted to move away from e.g. AWS.)

cagataygurturk 3399 days ago

Sorry for endless number of typos and mistakes. Obviously I was sleepy while I was writing this.

gtsteve 3400 days ago

> Seriously: I don't understand why you guys stay with AWS.

Personally I've been using it for ages and I know most services inside and out. They do suffer downtime in some regions occasionally, but it'd be too expensive at this point to move.

And who doesn't suffer downtime? You can't avoid it; you just need a plan to deal with it. For example, having a backup replica bucket in another region and the ability to quickly switch your CDN over would probably be a good idea here; that's what I did.

If you want to go further you can replicate your data to another cloud provider entirely and use low TTLs to switch to a backup CDN if your system is that mission-critical (in the event of a worldwide AWS failure doomsday scenario).

All systems will fail you and it's our responsibility as IT professionals to have a plan to mitigate this.

nicolaslem 3399 days ago

Low TTL on DNS entries might do more harm than good: if your DNS provider gets seriously DDoS, being able to rely on caches can save the day.

Anyway, I agree with your conclusion.

Sunk cost fallacy.

I do agree that we should all plan for failures.

However, I also think it's a sign of failure in planning and architecture foresight if it's too expensive to move away from a particular cloud provider.

froogle 3400 days ago

The sunk cost fallacy is when you (irrationally) decide to stick with what you're doing purely because you've already spent a lot of resources on it. It doesn't apply when you've done an economic analysis and found out it doesn't make sense to swap.

There are plenty of cases where it just wouldn't make sense to switch after looking at the costs, opportunity costs, etc. For example, if his site makes him $10 a month, outages cost him $1 a month that could be mitigated by moving, and it would cost $1000 of labor to swap providers. (Depends on interest rates.)

Perhaps it was originally a failure to not have a plan to easily move from a provider, but it doesn't seem unreasonable to me that right now it may cost too many hours of work to justify the move.

gtsteve 3399 days ago

It's not as though it would be impossible; our integration with AWS isn't that deep, it's not as though we use DynamoDB for our core data store or anything like that. But even migrating from one traditional datacenter to another isn't easy from an operational point of view.

There needs to be a clear financial win. Even taking into account the failures we've seen so far, I don't see a compelling reason to leave AWS.

(You're right, I used that term incorrectly.)

Still stand behind the other two points I made in that post though.

rattray 3400 days ago

> I don't understand why you guys stay with AWS.

Who do you recommend instead (assuming in-house or Hetzner-equiv is out of reach)? Google Cloud? Azure? Rackspace?

https://thehftguy.com/2016/06/15/gce-vs-aws-in-2016-why-you-...

Google Cloud if you're looking for something similar. It's just so much better and cheaper. I think a lot of the resistance here towards that kind of move is just because people are inherently lazy and they aren't paying the bill themselves.

(I'm guessing a relatively large part is also selfish attachment to the market leader because of employment reasons. I hate wasting money, both for myself and for my employer, so I don't really understand this kind of thinking - but I do understand how it could flourish in a venture capital-rich time/locale.)

I also recommend reading:

debaserab2 3400 days ago

Google Cloud doesn't exactly have the greatest reliability/uptime either.

eicnix 3400 days ago

https://status.cloud.google.com/summary tells a different story or do you have other information?

I have used GCP for some time without being affected from any incident.

joatmon-snoo 3400 days ago

Google also doesn't have the best record for developer tools.

chebum 3400 days ago

GC's CDN doesn't cache files bigger than 4Mb. No Windows VMs. Bound to AWS for these 2 reasons.

https://cloud.google.com/compute/docs/instances/windows/

dragonwriter 3400 days ago

GCE supports Windows VMs.

mrmaximus 3399 days ago

As already mentioned, they do have Windows VM's but there are some caveats that indicate it's not fully baked yet. 1.) They require that each VM MUST have a public IP address so that Windows can talk to an activation server every 30 days. 2.) You cannot yet bring your own license.

https://cloud.google.com/cdn/docs/caching

vgt 3400 days ago

Someone else already mentioned Windows VMs.

Looks like CDN has a 10MB limit:

(work at Google Cloud)

kohuma 3400 days ago

What about something like B2 from https://www.backblaze.com/ ?

Twirrim 3399 days ago

S3 in a single region is based out of multiple data centres / availability zone, with data distributed so that the loss of a single availability zone won't impact either data availability or durability, even to the point of being comfortable with complete physical destruction of an AZ. The same applies for Azure, GCP etc.

B2 is based out of a single DC (or at least, was at launch and I don't see anything that suggests that has changed?) You've got to decide what's most important to you. Data persistence or $$$.

pmalynin 3400 days ago

OVH

LeoHaggins 3400 days ago

Bad idea there, support is horrible.

rspeer 3400 days ago

OVH doesn't even want to take my money to keep my server running. Their auto-billing process is busted and when it goes wrong they just delete your server.

jacquesm 3400 days ago

What is your last datapoint on that?

The last year or two has seen a remarkable improvement according to those customers of mine that host there.

derefr 3399 days ago

I think it's more, "if the service can't do what people need it to do, that's a problem; if the service cluster gets wedged hard enough to stop responding to the requests of our monitoring system, that's a failure."

Which would make sense (and is sorta-kinda a best-practice) if Amazon wrote services such that they "crashed early"—but instead they're seemingly written so the backend lock up and be rendered completely useless at "doing its job" but will continue to run just fine.

Either of those two design decisions is potentially a good thing on its own, but they need to be considered in light of one-another if you want your status page to make any sense. If you want to report cluster failures, code your clusters to actually fail. If you want to keep your clusters up, write your monitoring checks as whole-stack acceptance tests.

notyourwork 3399 days ago

> Seriously: I don't understand why you guys stay with AWS.

You don't seem to have enough experience to comment on the issue.

https://news.ycombinator.com/item?id=13765786

johansch 3399 days ago

Please visit this comment sub-tree:

notyourwork 3397 days ago

That is a regurgitation of your opinion without any facts.

Comparing technology and saying "it seems" or "i feel" isn't really a good argument to convince me one way or the other.

tlogan 3400 days ago

> Seriously: I don't understand why you guys stay with AWS.

I tried them all and Amazon is still the best.

Postgres on RDS

Come to NEXT in a week! :).

espeed 3399 days ago

Any chance UDF iterators for Cloud Bigtable are in the works?

Being able to run distributed D4M/GraphBLAS queries in Cloud Bigtable would be killer.

"From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph Algorithms inside a BigTable Database" https://arxiv.org/pdf/1606.07085.pdf

hartleybrody 3400 days ago

I'm seeing green checkmarks across the board, but they just added a notice to the top of the page:

> Increased Error Rates

> We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.

syntheticcdo 3400 days ago

I guess sub-1% to 100% failure rate is technically an "increase".

Artemis2 3400 days ago

I guess file uploads and downloads are technically "API calls".

samstave 3400 days ago

the worst thing is when your system cant handle these "increased error rates" as your control plane cascades failure due to something like this....

The worst "increased error rate" problem I had was when the API was failing and my autoscale system couldnt deal and launched thousands of instances because it couldnt tell when instances were launched (lack of API access) and the instances pummelled the fuck out of all other parts of the system and we basically had to reboot the entire platform....

Luckily, amazon is REALLY forgiving with respect to costs in these (and actually most) circumstance....

salvor 3399 days ago

recalls numerous times

Yes. Yes they are. Thankfully.

matwood 3400 days ago

I always joke that if one of those statuses ever went to red, it means the zombie apocalypse has begun.

paulddraper 3400 days ago

The number of non-green marks is the number of ICBMs currently in flight towards an AWS data center.

cperciva 3400 days ago

The good news is, if Amazon's services are marked as offline, you're allowed to use Amazon Lumberyard to control nuclear power plants.

chrisan 3400 days ago

In case anyone wants to see what mysterious the red icon looks like: https://status.aws.amazon.com/images/status3.gif

At best when there are problems (not like now I guess) I will see the "note" green icon https://status.aws.amazon.com/images/status1.gif

krylon 3400 days ago

I've heard (on the Fnord new show on the most recent CCC congress, so take it with a grain of salt and a bucket of humor) that Amazon's TOS are more or less void when a Zombie Apocalypse breaks out.

They had some convoluted but fairly specific wording in their TOS, whoever wrote must have had a lot of fun.

jfim 3399 days ago

From https://aws.amazon.com/service-terms/

> 57.10 Acceptable Use; Safety-Critical Systems. Your use of the Lumberyard Materials must comply with the AWS Acceptable Use Policy. The Lumberyard Materials are not intended for use with life-critical or safety-critical systems, such as use in operation of medical equipment, automated transportation systems, autonomous vehicles, aircraft or air traffic control, nuclear facilities, manned spacecraft, or military use in connection with live combat. However, this restriction will not apply in the event of the occurrence (certified by the United States Centers for Disease Control or successor body) of a widespread viral infection transmitted via bites or contact with bodily fluids that causes human corpses to reanimate and seek to consume living human flesh, blood, brain or nerve tissue and is likely to result in the fall of organized civilization.

grogenaut 3399 days ago

First the fall of human civilization has to be a real threat per the TOS so not sure they'll care.

Second, I know the lawyer and yes he had fun.

obsurveyor 3400 days ago

Then I guess it has begun, the page is now showing red. I'd put a picture on imgur but it's not loading.

jonstaab 3400 days ago

http://downdetector.com/status/aws-amazon-web-services looks like a reasonable alternative place to check/report downtime.

zedpm 3400 days ago

I just check Twitter, since Amazon's status is always a lie. My personal dashboard is still showing no problems. It's bad enough that the main public status is always green even when there's clearly a problem, but you'd think they could at least make the private status accurate.

eicnix 3400 days ago

Which is coincidently down.

talawahdotnet 3400 days ago

Maybe they are hosted on S3 facepalm or maybe they just got a surge in traffic

booleanbetrayal 3400 days ago

downdetectordown.com ?

pure_ambition 3400 days ago

yep, page won't load.

jonstaab 3400 days ago

Gah. It was up 3 minutes ago. Anyone have any suspicion this is another ddos episode? I saw that SO was down last night too: https://twitter.com/StackStatus/status/836450836322516992

cperciva 3400 days ago

Pretty confident that isn't it. S3 was returning InternalErrors for 22 seconds before it started timing out and/or returning 503s to all my requests.

I'd bet that something broke (causing InternalError responses) and then nodes started marking themselves as failed (causing the timeouts and 503s soon after).

chx 3400 days ago

I want to see the botnet capable DDoSing S3. That would be something.

vjdhama 3400 days ago

Apparently, that's down too. Sigh.

Fiahil 3400 days ago

So, global S3 outage for more than an hour now. Still green, still talking about "US East issue". I'm amazed.

gtsteve 3400 days ago

It doesn't appear to be global; my app in eu-west-1 appears unaffected.

It's possible that the console won't work however as I believe that's served from us-east-1.

chebum 3400 days ago

My site hosted on S3 is also running.

gordon_freeman 3399 days ago

Looks like they have fixed the issue with their health dashboard now.

From https://status.aws.amazon.com/ : Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.

[1] https://phd.aws.amazon.com/phd/home?region=us-east-1#/dashbo...

talawahdotnet 3400 days ago

There was an alert on the personal health dashboard[1] a second ago, it said S3 Operational issue in us-east-1 but when I tried to view the details it showed an error.

Then I refreshed and the event disappeared altogether.

socialentp 3400 days ago

Same here. But it is in the general status dashboard: http://status.aws.amazon.com/

tuna-piano 3400 days ago

Still green now, 8 minutes in.

bpicolo 3400 days ago

I've had a few non-Amazon providers tell me AWS things are not working in the last 5 minutes, no note from Amazon though.

Nice.

leesalminen 3400 days ago

Just sent out a notice to our customers via our status page. I really wanted to be able to add a link back to AWS detailing the issue but that's a pipe dream I suppose.

fudged71 3400 days ago

... still green

https://news.ycombinator.com/user?id=jeffbarr

clamprecht 3400 days ago

Calling @jeffbarr

adrenalinelol 3400 days ago

Looks like his personal site isn't loading... :)

Yup, it is indeed hosted with Amazon.

ceejayoz 3400 days ago

Me right now: https://www.youtube.com/watch?v=_cHa063Mwos

ak2196 3400 days ago

We have a slack emoji for it called greenish. It's the classic AWS green checkmark with an info icon in the bottom. Apparently it's NOT an outage if you don't acknowledge it. It's called alt-uptime.

foxylion 3399 days ago

I really liked it. But when trying to add it to my HipChat group it failed to upload. Why? S3 outage, what an irony.

nhumrich 3399 days ago

AWS internal lingo calls this the "green-i"

cheeze 3400 days ago

Just went yellow

Edit: nevermind