| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kkielhofner 645 days ago

> Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected

I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this". With the warranty SLA worst case scenario they'll just replace the entire machine if they have to although I don't remember ever seeing it come to that.

> just plain "Actually, THAT failure mode isn't redundant."

When it comes down to it similar issues exist with clouds - regions, availability zones, etc. Big clouds have had multiple widespread outages just this year[0].

From that reference you can see that MS and Amazon themselves struggle to design, build, and run solutions for their own products in their own clouds.

It's always interesting to see marquee household name companies/products/solutions go down when US-East (or whatever) is having a bad day again.

Cloud can be a lot of things but a silver bullet for reliability and uptime isn't one of them.

[0] - https://www.forbes.com/sites/emilsayegh/2024/07/31/microsoft...

1 comments

stackskipton 645 days ago

>I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this".

Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.

No, public clouds are not 100% reliable either. It's just their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved.

kkielhofner 645 days ago

> Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.

Anecdotal (as is my position). I can theoretically understand this happening but not only have I never seen it, such an issue would need to be escalated. That's a "this is unacceptable" high-level phone call. A call you more than likely have a chance of someone in actual authority answering because IME unless you have SERIOUS spend with big cloud you'll be lucky to make it a rung or two up sales/support.

Plus backups and redundancies that should prevent even the failure of a chassis/storage/etc from being a significant critical issue.

> their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved

As a Founder/CTO I have the opposite take - put me and my team in a position to /do something/ vs sitting around waiting for AWS to come back whenever it decides to and while they obscure comms, don't update the fake status dashboards, etc. Meanwhile you're telling your customer "Ummm, we don't know - Amazon has a problem. When it comes back I guess it's back".

Coming from a background of telecom, healthcare, and nuclear energy I can't believe that even flies.