Hacker News new | ask | show | jobs
by hedgehog_irl 1292 days ago
Everyone seems to overlook the point here. That yet again Amazon were slow as hell to be honest with their customers. I get it up down reports help but why do you keep using a service which lies to you about availability. I've read on HN in the past how the dashboard can only be updated to reflect an issue with approval. (Comments section on a similar posting, believe it if you wish). So why not move to a hosting company that is transparent and open about their status. I'll not make suggestions as I don't want to be accused of trying to shill for a specific provider but there are plenty out there. 45 min to update their public dash is too slow. They either don't care, don't monitor or they are trying to hide their stats for fear Jeff will beat the staff for SLA violations. If any other provider lied to customers the way AWS does they wouldn't be tolerated why do you tolerate this behaviour from AWS?

Edited to fix auto correct issues

3 comments

Good luck convincing your company "Hey because they were 45 minutes late in informing us we need to move all our cloud to a different provider."

Updating a dashboard can easily be an automated process but for business reasons it is not. AWS did not "lie" about the incident - they are extremely transparent for all outages and disruptions (btw this was a disruption - not an outage). They stated on the issue the exact time frame for when the issue started and when it ended.

Is it bad they were late? Definitely. AWS has a history of being late due to the sheer scale it works at. I've caused an outage myself when I used to work there and updating the dashboard requires several higher ups to understand what exactly the issue is and what is considered to be worthy of "informing of an incident." These processes take time. Is it perfect? Absolutely not. But there are legitimate reasons for it.

I'm not sure why you think Jeff is involved here. This kind of disruption isn't enough to warrant someone as high as Jeff to be involved.

As for SLA violations, AWS public SLAs for every service and they credit your account if it ever dips below those defined thresholds. And as for caring I don't know a single cloud provider with the level of great customer support AWS has. This is extremely opinionated but this is what I've observed in the industry.

I would recommend people to use AWS monitoring. But having some of your own basic internal dashboards / metrics is also worth having.

> I would recommend people to use AWS monitoring

Use a monitoring service to monitor the provider of the monitoring service? Wouldn't it be better to use a monitoring service hosted on a totally different provider?

I'm not even sure running your own monitoring is sufficient in this case. Sure it's useful to have, but when something goes wrong, the first thing I want to know is if it's us or them. If it's us, I/the team scramble to fix it in a panicked frenzy. If it's them (the cloud provider), and they acknowledge it early, even a simple "we're investigating an issue with X", we can at least take some comfort from the fact that it's out of our hands.

If we just don't know the cause, we assume it's us and jump into panicked frenzy mode. Panicked frenzy days are the worst days of my life, especially if it's discovered that it was all in vain.

I understand the frustration, but Im not convinced monitoring at large scale is that straightforward.

The core question is: what constitutes degraded service? Would you say a service is experiencing downtime every time a 500 response is served? If you're serving millions to billions of requests/sec it seems a bit disproportionate to marka service down after a single 500 error, so then you need to work out some kind of acceptable threshold.

What about latency? Again you're just going to draw a line in the sand somewhere.

You end up with this big mix of metrics that define service quality, so you then have a kind of meta problem of deciding which metrics you should alert users on. Get too trigger happy and it's going to cost you money and customer trust, and your customers are going to get alert fatigue when it turns out the issue you alerted them about was more of a false alarm. Set the bar too high and you'll have angry customers wondering wtf is going on.

All that to say I don't think there's a right answer.

We were pretty liberal with posting to our status. page for years and thought it was The Right Thing to do. I still do, to a point.

But, what ended up happening was a competitor who didn't have a status page at all would use our status page against us in the sales process. They just never mentioned their lack of a status page to compare to.

This was the same competitor who went 100% down for ~4 days during the busiest month of the year and only posted updates to a private Facebook group. There was data loss that was never publicly admitted to.

So, yeah, we implemented reasonable boundaries on what constitutes a post to the status page. We also adopted a new status page provider that let us get more granular with categorizing posts, and allowing users to subscribe to only "urgent" channels that pertain to them.

Before 2003-ish Amazon used to have a static "gonefishing" page on www.amazon.com that was manually triggered during outages. Due to newspaper reporters writing scripts that would detect the GF pages they were removed and the site was allowed to just spew 500s for whatever segment of critical pages was busted.
Very fair but 45 min of an outage/disruption before manually updating public status is poor service and why is that acceptable for Aws to deliver to users
AWS is the 800 pound gorilla in the cloud space. Are any of the other cloud providers better with customer honesty?
Also good luck trying to convince your company to migrate to another cloud provider over, say, implementing multi-region strategy, which you should have been doing in the first place.
Highlight the lack of transparency on reporting outages and that's a start. If your MPLS or ISP provider operated in the save way. The company wouldn't accept it
My company is not going to spend hundreds of thousands of dollars or more, and months or even years of effort, and add additional constraints to the given pool of candidates we are hiring for, to migrate to GCP or Azure or DigitalOcean or Hetzner or wherever is considered more trendy than AWS right now due to "a lack of transparency" lmao. I would look completely incompetent to even suggest the idea to anyone internally.
But your company is willing to accept poor service and as a result spend more money with the same provider to ensure continuity. So essentially you reward Aws hiding their stats. As they can claim high uptime figures and when an outage happens it's the users fault for not spending enough money with them to have many many instances around the availability zones to ensure your covered the Aws mess up. I get it redundancy is needed in systems but lack of proper reporting message users are forced to over spend our of fear. It's a great business model. Hook the clients in with lies and then get them to reward you for hiding facts. Clearly your company has money to burn wasting it like this. Every one knows they lie and are blatant about it why is it tolerated. As I said I don't see other enterprise providers getting away with this kinda behaviour towards clients
If you are willing to host your critical infra on some dodgy startup alternative that might go away in 3 months because you refuse to bend on your personal values and separate them from what the typical organization actually cares about, best of luck. I know HN tends to loves the underdog, but there is a time and place for that, and a time and place to accept what you need to do to keep your services online.
Your company is hiring and retaining people who can't work with tooling outside Amazon Web Services?
Many companies are hiring and retaining specialists in AWS-lock-in-technology, who lack experience with another-cloud-provider-technology, so I don't know what's surprising.
Training and getting up to speed takes time and money, neither of which are unlimited for any organization. It's not that they/we can't work with other cloud services, it's that it would likely add up to months of additional on-boarding time to get someone who wasn't familiar with another cloud provider productive with infra at scale on said provider.
Didn't say you have to go "cloud" rent hardware in a DC and run that yourself. Or use a VPS I mean the cloud is just "Other people's hardware" and I'd thank you to not insult gorillas like that by comparing them to Amazon.
this just bring whole load of new expenses on staffing physical locations creates more problem than it solves
Colocation or especially server rental generally requires no persistent staffing. The datacenter has their own staff for tasks requiring physical intervention, and you have IPMI/iLO access to your servers for doing reboots and similar.
Not really renting from a DC provider means you just run the host yourself they deal with power space cooling etc
I'd ask if there any cloud providers worse with customer honesty instead.
I'm sure there are 2bit vps providers that claim to be cloud and are terrible. But for the price and claims of service like Aws I donno they are at the scale where they don't have to care about customers