| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by contingencies 3787 days ago

No CI/test process was in place for critical systems to ensure that they had no external dependencies.

Takeaway: If you run any complex system, ensure that each component is tested for its response to various degrees of failure in peer services, including but not limited to totally unavailable, intermittent connectivity, reduced bandwidth, lossy links, power-cycling peers.

No CI/test process was in place for hardware/firmware combos to ensure they recovered fine from power loss.

Takeaway: If you run a decent-sized cluster, ensure all new hardware ingested is tested through various power state transitions multiple times, and again after firmware updates. With software defined networking now the norm, we have little excuse not to put a machine through its paces on an automated basis before accepting it to run critical infrastructure.

No CI/test process was in place for status advisory processes to ensure they were sufficiently rapid, representative, and automated.

Takeaway: Test your status update processes as you would test any other component service. If humans are involved, drill them regularly.

Infrastructure was too dependent on a single data center.

Takeaway: Analyze worst case failure modes, which are usually entire-site and power, networking or security related. Where possible, never depend on a single site. (At a more abstract level of business, this extends to legal jurisdictions). Don't believe the promises of third party service providers (SLAs).

PS. I am available for consulting, and not expensive.