The issue was on a network equipment that they didn't even manage. You can't load balance when your core network is down. There was nothing they could do as I understand it.
Their architecture is not fault-tolerant. If one server goes down and the whole system goes down, then it was not designed to be fault-tolerant.
I have been running fault-tolerant systems spread across multiple dedicated servers (inside system with multiple DB/KV stores distributed/replicated/sharded, Kafka etc). If one server experiences hardware failure, the system will automatically recover within seconds to minutes (depending on which server/part of service failed) without any data loss.
It's not that hard. You need the knowledge, but it's not rocket science.
I have been running fault-tolerant systems spread across multiple dedicated servers (inside system with multiple DB/KV stores distributed/replicated/sharded, Kafka etc). If one server experiences hardware failure, the system will automatically recover within seconds to minutes (depending on which server/part of service failed) without any data loss.
It's not that hard. You need the knowledge, but it's not rocket science.