| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by swrobel 3787 days ago
	Anyone got a good tl;dr version?

6 comments

alblue 3787 days ago

Power outage in DC brought many machines down. Redis clusters failed to start owing to disk issues (not cleanly unmounted?). The reboot of remaining machines uncovered an unknown dependency on the machines needing the redis cluster to be up in order to boot.

There were other learning points such as immediately going into anti DDoS mode and human communication issues that didn't realise or escalate the problem until some time after the issues started occurring.

link

aidenn0 3787 days ago

Power outage brought 25% of servers down.

Firmware issue meant that a large fraction of their servers could not detect the disks on reboot.

This prevented the redis cluster from starting.

They inadvertently have a hard-dependency on redis being up for the majority of their infrastructure to start.

link

daigoba66 3787 days ago

Lost power. Took a while to get the servers cleanly rebooted.

link

contingencies 3786 days ago

No CI/test process was in place for critical systems to ensure that they had no external dependencies.

Takeaway: If you run any complex system, ensure that each component is tested for its response to various degrees of failure in peer services, including but not limited to totally unavailable, intermittent connectivity, reduced bandwidth, lossy links, power-cycling peers.

No CI/test process was in place for hardware/firmware combos to ensure they recovered fine from power loss.

Takeaway: If you run a decent-sized cluster, ensure all new hardware ingested is tested through various power state transitions multiple times, and again after firmware updates. With software defined networking now the norm, we have little excuse not to put a machine through its paces on an automated basis before accepting it to run critical infrastructure.

No CI/test process was in place for status advisory processes to ensure they were sufficiently rapid, representative, and automated.

Takeaway: Test your status update processes as you would test any other component service. If humans are involved, drill them regularly.

Infrastructure was too dependent on a single data center.

Takeaway: Analyze worst case failure modes, which are usually entire-site and power, networking or security related. Where possible, never depend on a single site. (At a more abstract level of business, this extends to legal jurisdictions). Don't believe the promises of third party service providers (SLAs).

PS. I am available for consulting, and not expensive.

link

maerF0x0 3787 days ago

Intern trips on power cable, 25% of servers go down.

Edit this is mostly the "DR" part of tldr :P

link

draw_down 3787 days ago

"Stuff went wrong and our servers were down for a couple hours."

You're welcome.

link