Hacker News new | ask | show | jobs
by cagenut 5125 days ago
Hey guys I got this, I speak cloudonaut. Here I'll translate it to sysadmin:

An admin was doing a rolling restart that triggered a bug in the loadbalancer software. The auto restart script turned out to just make things worse by restarting it over and over (they always do), so we thought we'd just quick throw spare capacity at it, but turns out that never works in a panicked rush either. Also, our system designed to handle outage notifications wasn't capacity planned, like, at all.

1 comments

I know this is a joke, but from the sounds of the errors that isn't far from true. This is also basically what happened with the big two-day outage at Amazon a while back. It's always the automated processes that come back to bite you it seems.

I know I have had my share of server issues, but it seems to me that many 'cloud' services out there are simply adding too many layers of abstraction that tend to make things very, very touchy to any small issue occurring. Because of this I try to keep my server stacks/frameworks as basic as possible while still implementing performance oriented services like NoSQL, caching, etc.

Although I have my fair share of hesitance at worshiping cloud services, the fact that a service is "cloud" has nothing to do with the quality of its architecture. You can make a crucial architecture mistake designing a fleet of dedicated PHP servers talking to a MySQL cluster just as easily as you can building atop some cloud service.