Hacker News new | ask | show | jobs
by raffraffraff 859 days ago
Oh NTP... I remember a series of extremely annoying incidents that were caused by time skew on hundreds of Linux VMs in our data center. Our setup was typical of a startup - built to be good enough at first, and fall apart at scale.

Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.

Every few weeks we'd have one of these "events" where hundreds of VMs in a data center would skew, causing havok with authentication, replication, clustering. We also had an alert that would notify the machine owner if drift exceeded some value. If that happened in the middle of the night, the oncall from every single team would get woken. And if they simply "acked" the alert and go back to sleep, the drift would continue, and by morning their service would almost certainly be suffering.

Whatever about diagnosing the cause, I started by writing a script that executed a time fix against a chosen internal server, just to resolve the immediate issue. I also converted the spam alert into one that Sensu (the monitoring/alerting system we used) would aggregate into a single alert to the fleet ops team. In other words, if >2% of machines was skewed by more than a few seconds, warn us. At >4%, go critical. (Only critical alerts would alert oncall outside sociable hours).

Long story short, we switched to chrony, because unlike ntpd we could convince it to "just fix the damn time", because ntpd would refuse to correct the time if the jump was too big, and would just drift off forever until manually fixed. (No amount of config hacking and reading 'man ntpd' got around this). We also chose two bare-metal servers in each data center to work as internal NTP servers, reducing the possibility of DOSing these volunteer NTP servers and getting our IP range blacklisted or fed dud data. Problem solved right there, and we also ended up with better monitoring of our time skew across our fleet.