Hacker News new | ask | show | jobs
by zkms 3456 days ago
What causes real-world problems with leap seconds is actually unrelated to the nasty interactions of metrology and solar time -- it's a specific and avoidable problem with how NTP (and many OSes/languages) represent time -- it's a types issue.

The right way for computers to represent time is with a number that represents the number of constant-rate ticks that have elapsed past a some agreed-upon epoch. If you know what the epoch is and how long each tick is (lots of people use 1 / 9.192 GHz), it is easy to know how many ticks are between any two time values, and you can convert a time value with one epoch to one with a different epoch and tick rate -- you can do everything people expect to do with time. There are no numbers that represent an invalid time value, and for each moment, there is a unique time value that represents it. There's a one-to-one mapping with no nasty edge cases.

Leap seconds are a step function that is added to a constant-rate timescale (whose name is "TAI") in order to generate a discontinuous timescale (whose name is "UTC") that never is too different from solar time. There is nothing fundamentally abhorrent about leap seconds -- there are just good and bad ways to represent, disseminate, and compute with timescales that involve leap seconds.

The right way to handle leap seconds can be seen with many GNSSes and PTP (very high precision hardware-assisted time synchronization over Ethernet). GPS, BeiDou, Galileo, and PTP all involve dissemination and computation on time values -- and with dire consequences for failure/downtime/inaccuracy.

The designers of those systems all somehow converged on the choice to separate out the nice, predictable, constant-rate and discontinuity-free part of UTC from the nasty step function (the leap second offset). Times in all those systems are represented as the tuple (TAI time at t, leap offset at t). This means that the entire system can calculate and work with (discontinuity-free and constant-rate) TAI times but also truck around the leap offsets so when time values need to be presented to a user (or anything that requires a UTC time), the leap offset can be added then. Crucially, all the maths that are done on time values are done on TAI values, so calculating a time difference or a frequency is easy and the result is always correct, regardless of the leap second state of affairs. Representing UTC time as a tuple makes the semantics of that data type easy to reason about -- the "time" bit is in the first element and is completely harmless -- the edge cases have all live in the second half of the tuple.

NTP and Unix (and everything descending and affected by those) have made the mistake of representing and transmitting time as a single integer, TAI(t) + leap offset(t). This is not a data representation that has sensical semantics and it is very hard to reason about it. First of all, the leap second offset is nondeterministic and also unknown -- there is no way to get it from NTP and there is no good way to know the time of the next leap event. Second of all, there are repeated time values for different moments in time (and when a negative leap second will happen, there will be time values that represent no moments in time). Predictably, introducing nondeterministic discontinuities doesn't work so well in the real world. There are a bunch of bugs in NTP software and OS kernels and applications that make themselves shown every time there is a leap second. It's not even just NTP clients that struggle -- 40% of public Stratum-1 NTP servers had erroneous behavior [0] related to the 2015 leap second! Given that level of repeated and widespread failure, the right solution is not to blame programmers -- it should be to blame the standard. The UTC standard and how NTP disseminates UTC are fundamentally not fit for computer timekeeping.

GNSS receivers and PTP hardware get used in mission-critical applications (synchronizing power grids and multi-axis industrial processes, timestamping data from test flights and particle accelerators) all the time -- and even worse, there's no way to conveniently schedule downtime/maintenance windows during leap second events! "Leap smear" isn't an acceptable solution for those applications, either -- you can't lie about how long a second is to the Large Hadron Collider. GNSS and PTP systems handle leap second timescales without a hitch by representing UTC time with the right data type -- a tuple that properly separates two values that have the same unit (seconds) but have vastly different semantics. The NTP and unix timestamp approach of directly baking the discontinuities into the time values reliably causes problems and outages. The leap second debacle is not about solar time vs atomic time; it's about the need for data types that accurately represent the semantics of what they describe.

[0]: http://crin.eng.uts.edu.au/~darryl/Publications/LeapSecond_c...

2 comments

Except people want to be able to talk about times years in the future despite not knowing the number of leap seconds that may happen in the intervening time. It is more useful in most fields to talk about an event happening every N years/months/days than an event happening every N seconds. Most people do not want a leap second to shift their scheduled event from 10:00:00 every Monday to 9:59:59 or 10:00:01 in the name of using a whole number of 86400-second intervals.
If you want to say "10:00 every Monday" then say "10:00 every Monday" and accept that what you have is not an unambiguous point in time, nor an integer, but a calendar event that may occur at some time in the future depending on geopolitical changes to the local time zone and the rotation of the earth.

Mutilating all timestamps and network time representations by adding a variable unknown step function (the leap second "correction") in order to preserve the illusion that days are always 86400 "seconds" long doesn't help solve this problem at all.

> Except people want to be able to talk about times years in the future despite not knowing the number of leap seconds that may happen in the intervening time.

Doesn't the (TAI, leap second count) tuple solution work for this? Maybe I misunderstand the purpose, but you could use the leap second count to figure out how many seconds the TAI is off by.

But that doesn't matter, because date intervals shouldn't be represented with seconds anyway. Months and years have different lengths.

…I forgot to mention this in my original comment, but real-world wall-clock time is in any case discontinuous due to daylight savings time and other timezone changes. This means that it's not only months and years that change in length, but days and weeks too.
You can not represent "Next Monday at 12:00" with a tuple (TAI, leap second count), because you don't know how many leap seconds there should be. Or maybe you know for next Monday, but you definitely don't know for the Monday in a year, as leap seconds are only announced ~6 months in advance.
I think you cannot represent "Monday in a year at 12:00" with a simple integer either, right? For example, the king of the country may decide to cancel DST for the year. Either way you would have to store it as a calendar event and figure out the exact time once you're closer.
You should store that similar to this:

    begin = (today, 12:00) (eg. 2017-01-01T12:00:00)
    repeat = RRULE:FREQ=WEEKLY;COUNT=1;BYDAY=MO
Note that "begin" is usually something software figures out itself.
I think this is solved by storing an event date as UTC (since we can't always know how many leap seconds will be required), but when triggering an event, we calculate the UTC from TAI + Leap Seconds.

An event in the future isn't necessarily a known number of seconds away, which I think is the point you were trying to make. But the parent comment wasn't suggesting all instances of time should be stored as (tai, leap seconds). Calculating a UTC value from (tai, leap seconds) is trivial, but if the thing you care about is the UTC value then that's what you store.

Sometimes it's best to store a scheduled events as "Event localtime" and "Timezone" (where timezone is a named description - e.g. "Europe/Madrid" - rather than an offset - e.g. "+1:00").

This allows the record to stay consistent, even if there are changes to the local time rules - e.g. leap seconds, daylight savings, timezone offset.

Imagine a tech-camp had been planned in Cairo, Egypt, to start on 9am on July 10, 2016: that would have been scheduled for 06:00 UTC. When Egypt cancelled daylight savings with three days notice, that record should then have been 07:00 UTC.

Yup, I'm aware of this and should have mentioned it in my comment. Thanks for the follow up.

As an aside, how often do the tz databases for each language get released? Are they usually responsive to notices 3 days out?

Edit: I went looking into the pytz release for the Cairo example from parent.

Olson Timezone Database:

Release 2016f - 2016-07-05 16:26:51 +0200

https://github.com/stub42/pytz/commit/03a4e9b31dd90f3dace1eb...

Pytz:

Release 2016.6 - 2016-07-13

https://pypi.python.org/pypi/pytz/2016.6

So even if the tz database is up to date, there's no guarantee that various library usages of the tz database will be correct for these kinds of changes. Interesting.

I just came across a note about Morocco, which entered daylight savings time in March 2016, but then left daylight-savings in June for 35 days, re-starting daylight-savings in July [1].

I've read that the explanation for this temporary suspension of daylight-savings is Ramadan [2], and Ramadan is dependent on the observed sighting of the new moon - so you can't necessarily predict the date in advance.

I ended up coming across that after looking for an explanation for something bizarre I experienced on a trip to Morocco in March 2016…with my iPhone set to use "Marrakesh, Morocco", the time on the phone displayed correctly, but the time on my sync'd Apple watch was an hour out. I think I ended up manually setting it to Paris time to get the correct time, but never did get an explanation for the difference.

So even across two devices from the same manufacturer, theoretically sharing the same date-time information, they can be inconsistent.

Conclusion: time is hard!

[1] https://www.timeanddate.com/time/change/morocco/tanger?year=...

[2] http://codeofmatt.com/2016/04/23/on-the-timing-of-time-zone-...

Anything less than 2 weeks is a gamble; I follow the time zone list closely and go out of my way to poke some maintainers of libraries we depend on when something like the Egypt change happens
Another neat example in the "UTC ain't always the right thing to do" category.
Thank you so very much for the effort to elucidate the real problem hiding underneath the usual slew of Leap Second issues.

I keep telling people to use TAI. I once contemplated writing kernel code to rebate internal clock stuff to TAI but at the end of the day it was not worth doing because I would have needed to build a completely new stack of things above the kernel to use it in order to avoid problems.