Hacker News new | ask | show | jobs
by xaranke 2857 days ago
Was there a way to fix the services so that you wouldn't get woken up in the middle of the night?
1 comments

Not entirely. Some issues were fixable, like moving our RabbitMQ cluster away from RHEL to AWS. But others weren't. There was an upstream service we depended on that went down, that caused a cascading failure. It was the company's core product, a massive Java program running on bare metal that frequently OOM-killed our service, and even though it was the big money-maker, no team owned it, and nobody understood how it worked. I don't remember why our service had to share a host with this monster, but there was a good reason and it just couldn't be worked around.