Hacker News new | ask | show | jobs
by seuros 315 days ago
The AWS employee actually contacted me before my blog post even reached three digits in views. So no, it wasn’t PR-driven in the reactive sense.

But here’s what I learned from this experience: If you are stuck in a room full of deaf people, stop screaming, just open the door and go find someone who can hear you.

The 20 days of pain I went through, it wasn’t because AWS couldnt fix it.

It’s because I believed that one of the 9 support agents would eventually break script and act like a human. Or that they get monitored by another team.

Turns out, that never happened.

It took someone from outside the ticketing system to actually listen and say: Wait. This makes no sense.

3 comments

>So no, it wasn’t PR-driven in the reactive sense.

At my small business, we proactively monitor blogs and forums for mentions of our company name so that we can head off problems before they become big. I'm extremely confident that is what happened here.

It was PR-driven in the proactive sense. Which is still PR-driven. (which, by the way, I have no problem with! the problem is the shitty support when it isn't PR-driven)

Regardless, I 100% feel your pain with dealing with support agents that won't break script, and I am legitimately happy that you both got to reach someone that was high enough up the ladder to act human and that they were able to restore your data.

Thank you for your concern, and I appreciate the nuance in your take.

Yes, it is totally possible that AWS monitors blogs and forums for early damage control, like your company does.

But we shouldn’t paint it like I was bailed out by some algorithmic PR radar and nothing else.

Let’s not fall into the “Fuk the police” style of thinking where every action is assumed to be manipulation. Tarus didn’t reach out like a Scientology agent demanding I take the post down or warning me of consequences.

He came with empathy, internal leverage, and actually made things move.

When before i read Tarus email, i wrote in Slack to Nate Berkopec (puma maintainer): `Hi. AWS destroyed me, i'm going to take a big break .`

Then his email reset my cortisol levels to acceptable level.

Most importantly, this incident triggered a CoE (Correction of Error) process inside AWS.

That means internal systems and defaults are being reviewed, and that’s more than I expected. We’re getting a real update, that will affect cases like mine in the future.

So yeah, it may have started in the visibility layer, but what matters is that someone human got involved, and actual change is now happening.

>But we shouldn’t paint it like I was bailed out by some algorithmic PR radar and nothing else.

>[...] assumed to be manipulation

I think you're reading way more negativity into "PR" than I'm intending (which is no negativity).

It's very clear Tarus is a caring person who really did empathize with your situation and did their best to rectify the situation. It's not a bad thing that your issue may (most likely) have been brought to his attention because of "PR radar" or whatever.

The bad part, on Amazon and other similar companies, is how they typically respond when a potential PR hit isn't on the line. Which, as I'm sure you know because you experienced it prior to posting your blog, is often a brick wall.

The overwhelming issue is that you often require some sort of threat of damage to their PR to be assisted. That doesn't make the PR itself a bad thing. And that fact implies nothing about the individuals like Tarus who care. Often the lowly tier 1 support empathizes, they just aren't allowed to do anything or say anything.

It took someone from outside the ticketing system to actually listen and say: Wait. This makes no sense.

Which only happened because of your blog post. In other words, the effort to prevent bad PR led to them fixing your problem immediately, while 20 days of doing things the "right" way yielded absolutely no results.

This actually makes the problem you've described even worse: it indicates that AWS has absolutely no qualms about failing to properly support the majority of its customers.

The proper thing for them to do was not to have a human "outside the system" fix your problem. It was for them to fix the system so that the system could have fixed your problem.

That being said: Azure is so much worse than AWS. Even bad PR won't push them to fix things.

AWS Support absolutely fumbled the incident, but what you should have learned from the experience, and the majority of others commenting here is: Running a business critical workload in one AWS account is a self-inflicted single point of failure. Using separate accounts for prod/dev/test (and a break-glass account) is one of #1 security best practices:

“SEC01-BP01 Separate workloads using accounts.” - https://docs.aws.amazon.com/wellarchitected/latest/security-...

Keep resources out of the payer/management accounts. Consolidated billing is fine, but the management account should stay empty.

"Best practices for the management account" - https://docs.aws.amazon.com/organizations/latest/userguide/o...

Enable cross-account backups. Copy snapshots or AWS Backup vaults to a second account so Support lockouts don’t equal data loss.

"Creating backup copies across AWS accounts" - https://docs.aws.amazon.com/aws-backup/latest/devguide/creat...

Populate Billing, Security, and Ops alternate contacts. AWS Support escalates to those addresses when the primary inbox is dead. "Update the alternate contacts for your AWS account" - https://docs.aws.amazon.com/accounts/latest/reference/manage...

Follow the multi-account white-paper for long-term org design. It is not optional reading. "Organizing Your AWS Environment Using Multiple Accounts" - https://docs.aws.amazon.com/whitepapers/latest/organizing-yo...

Maybe get some training?

https://aws.amazon.com/training/classroom/