Hacker News new | ask | show | jobs
by lelag 600 days ago
It is hard.

Lichess admins are highly skilled and I'm sure they already have a well designed infrastructure. You can see what they use at https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSk...

The issue was on a network equipment that they didn't even manage. You can't load balance when your core network is down. There was nothing they could do as I understand it.

More details at: https://lichess.org/@/Lichess/blog/post-mortem-of-our-longes...

1 comments

Their architecture is not fault-tolerant. If one server goes down and the whole system goes down, then it was not designed to be fault-tolerant.

I have been running fault-tolerant systems spread across multiple dedicated servers (inside system with multiple DB/KV stores distributed/replicated/sharded, Kafka etc). If one server experiences hardware failure, the system will automatically recover within seconds to minutes (depending on which server/part of service failed) without any data loss.

It's not that hard. You need the knowledge, but it's not rocket science.