Hacker News new | ask | show | jobs
by RossBencina 191 days ago
-10ms, no redundant clocks, and they're leaving most of the servers up with that amount of skew. Wow. I am astonished that NIST does not have multiple clocks over multiple distributed sites with robust ability to detect and bypass individual failures.
3 comments

They do have multiple clocks and sites. The primary clock is in Boulder. Only the Maryland time servers are affected, the Colorado ones should be fine. They mention switching to another atomic clock, but that probably has to be setup.

The email explains why they haven't shut down, cause haven't hit the threshold. And talks about maybe shutting them down manually.

> I am astonished that NIST does not have multiple clocks over multiple distributed sites with robust ability to detect and bypass individual failures.

They may not operate redundant clocks at a single site, but ITS redundancy posture[1] doesn't look bad at all:

>> Servers at the Boulder and WWV/Ft. Collins campuses are independent and unaffected.

[1] https://tf.nist.gov/tf-cgi/servers.cgi

> I am astonished that NIST does not have multiple clocks over multiple distributed sites with robust ability to detect and bypass individual failures.

Is this sarcasm? I can't tell.

Per the email:

> Servers at the Boulder and WWV/Ft. Collins campuses are independent and unaffected.

Sorry, maybe I got carried away with the tone. But it is not sarcasm. I genuinely did not realise that the NTP service level was so low. There are two problems raised in the email: There is no on-site redundant fail-over upstream of the NTP servers. All NTP servers at the site were not automatically taken down immediately upon detection of the fault (because some were still, in some sense, within tolerance). This places all of the fault management onto downstream NTP servers. I honestly expected NIST to be running a robust cross-site timebase upstream of NTP.