Hacker News new | ask | show | jobs
by MatthiasPortzel 960 days ago
> In particular, two critical services that process logs and power our analytics — Kafka and ClickHouse — were only available in PDX-04 but had services that depended on them that were running in the high availability cluster. Those dependencies shouldn’t have been so tight, should have failed more gracefully, and we should have caught them.

This paragraph similarly leaves out juicy details. Exactly what services fail if logging is down? Were they built that way inadvertently? Why did no one notice?