Hacker News new | ask | show | jobs
by clord 1172 days ago
“Parquet has both a date type and the datetime type (both sensibly recorded as integers in UTC).”

What does it mean for a date to be utc? my date, but in the utc timezone? Usually when I write a date to a file, I want a naive date; since that’s the domain of the data. 2020-12-12 sales: $500. But splitting that by some other timezone seems to be introducing a mistake.

Often I want to think in local naive time too, especially for things like appointments that might change depending on dst or whatever. Converting to utc involves some scary things. Timestamps are also useful but I don’t want to transcode my data sometimes as the natural format is the most correct.

9 comments

Lot of confusion here. UTC is a time standard, not a particular time zone. An instant written down in, say, the Pacific Standard Time Zone can be a UTC-scaled time.

An example of non-UTC time is TAI, which is International Atomic Time. The difference is that UTC has leap seconds to deal with changes in the rate of rotation of the earth, while TAI marches on without any discontinuities.

So for a date to be “in UTC” really just means it uses the leap seconds published by IERS. This article says “integers in UTC” which is a little ambiguous, but probably means “integer UTC seconds since the Unix Epoch.”

> Lot of confusion here. UTC is a time standard, not a particular time zone. An instant written down in, say, the Pacific Standard Time Zone can be a UTC-scaled time.

UTC is very much a timezone, that’s why it has a timezone designator (Z).

If you record a future PST date as UTC and PST changes, your recorded date is now wrong.

> So for a date to be “in UTC” really just means it uses the leap seconds published by IERS. This article says “integers in UTC” which is a little ambiguous, but probably means “integer UTC seconds since the Unix Epoch.”

And that’s guaranteed to fuck up for future local events.

> If you record a future PST date as UTC and PST changes, your recorded date is now wrong.

Damn. That's a very good point. From now on, I'll be always recording also timezone, especially when it comes to future dates!

Any timezone you want, as long as it's UTC. (If above scares you, think about how to record an event that should happen 5th of November 2023 1:30 AM PST.)
> UTC is very much a timezone, that’s why it has a timezone designator (Z).

This is just not correct. The "Z" is a historical maritime designation for the zero-point timezone, GMT, and predates UTC by decades.

UTC is defined by the International Telecommunications Union in recommendation TF.460-6 (pdf link: https://www.itu.int/dms_pubrec/itu-r/rec/tf/R-REC-TF.460-6-2...). It's brief and you won't find any mention of time zones.

The symbolism "UTC+0" or the "Z" timezone come from ISO-8601, which is a specification for how to represent times in strings. You can find that specification here: https://web.archive.org/web/20171019211402/https://www.loc.g...

See section 2.1.12 Note 3:

> UTC is not a time zone, it is a standard. UTC is also not GMT (Greenwich Mean Time), rather, UTC has replaced GMT. UTC is more precise; the term 'GMT' is ambiguous.

That document goes to great lengths to keep this distinction between time scales and time formats. They're different, and conflating them will get you very confused. When you describe a time in PST, you are almost certainly using UTC.

I think you may have missed the idea: usually dates (not datetimes) are the legal fiction "date" NOT "an instant." I.e. timezones are irrelevant.

Birth dates, contract dates, sale dates, billing dates, insurance coverage dates, etc.

---

EDIT: UTC still plays a role -- as there is still the choice of calendar (though you'd be forgiven for assuming Gregorian) -- but it's an odd statement to decipher.

Ah! You're right, I totally missed that the commenter was griping about UTC dates as opposed to datetimes. I agree, a "UTC date" is not a clean concept. We can talk about Julian or Gregorian dates, but those are independent of time scales like UTC or TAI or UT1.
How are time zones irrelevant? The current date, at this very moment, depends on the time zone.

I think of date more like a date time simplified to day precision.

Still, I agree UTC date is unclear.

> How are time zones irrelevant?

Nobody is going to adjust your birthdate when you move abroad.

If you move east, you’ll be able to drink a little earlier than if you move west.

Nobody is adjusting Christmas Eve against the time zone of Bethlehem.

There are many situations where a date is just a number in a calendar and not a specific time on planet earth.

> Nobody is going to adjust your birthdate when you move abroad.

But that’s just a convention, right? The day you are born is still dependent on the time zone. If you are born in the US at 11pm EST then someone born at that same moment in the UK has a different birthday.

Dates have boundaries. These boundaries are dependent on time zone. We can talk about dates irrespective of time zone but day periods cannot be understood without reference to time zone.

You are mixing two things up into one conversation. You are adding time into the conversation, in which case yes you need timezones, but if you don't add time into the conversation and just have dates, then you don't have timezones.
> But that’s just a convention, right?

Yes, "just."

Dates are "just" a convention.

As you say, timezones are relevant if you need to covert an instant to a date, or vice versa.

But you can store, operate, and query on dates without the foggiest clue about instants or timezones.

But if you have a date, and a timezone...how do you do anything useful with those 2 things? If you add 8 hours to a date due to PST, what do you get?
UTC isn’t a time zone, its a specification for how many seconds are in a day. In UTC, there are 84000 seconds on most days, but the IERS may announce a “leap second” which makes some particular day either 84001 seconds or 83999 seconds (historically always the former).
That was already clear
A bit of an aside: by 2035, UTC will be a static offset from TAI, with the addition of leap seconds being phased out
That will be a wonderful change. Unfortunately we will still have to deal with leap seconds retrospectively, but it is certainly a step forward, and it will be nice to have a static list of historical leap seconds rather than all computers everywhere polling IERS every month or two.
Agreed. I think this is where Java’s latest date/time libraries shine (heavily based on Joda time lib).

It makes a solid distinction between Dates, times, instants, and zone dates/times.

And Instant is a moment in time. Exposed as a UTC time stamp.

Birthdays etc are best expressed as LocalDates. There’s also LocalDateTime for zone-agnostic features.

If I were writing software like you describe I’d want an explicitly zone-agnostic date(time) to represent it.

>If I were writing software like you describe I’d want an explicitly zone-agnostic date(time) to represent it.

I would love to see this as a data type (CalendarDate) baked into languages & databases which would prevent you from applying timezone conversions to it.

that would be the goal, but it's hard. I went down a long blog-post road about a "clockchain" which includes a way to reconcile zone-less time. tl;dr all time is relative between a minimum of two objects. all but certain there is an elegant way to do it beyond my current thinking.

https://www.seanmcdonald.xyz/p/the-clockchain-protocol-the-l...

cal.com is solving some of these problems with handling of all zones.

> The DATE logical type annotates an INT32 that stores the number of days from the Unix epoch, January 1, 1970.

So it's a naive date.

Not sure how to support naive time, though, like 23:59:59, (or even leap-second one, 23:59:60). Probably have to store it as integers, and deal with conversion on application side.

Parquet doesn't really do timestamp with time zone. Also relevantly for processing Parquet, Apache Spark doesn't really do timestamp with time zone. Meaning that if your data go through parquet (e.g. in a data lake) you have to store timestamps as strings, or lose the time zone.

It sucks.

We went for the 3. option, store the timezone in a separate column for data where the timezone is needed.
Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.

The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.

https://github.com/apache/parquet-format/blob/master/Logical...

Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.

For anyone interested in the design approach to Parquet, here's a good interview with the Parquet designers. It covers a lot of the issues they were trying to address with the format. https://www.youtube.com/watch?v=qe0SeC0Hr_k
What is a "naive" or "natural" date? A date in your current timezone? That's often a source of bugs if the tz is not explicit.
It can be a source of bugs if the tz is explicit, as well. Consider storing people's birthdates. If you accidentally store it as UTC and use some implicit tz handling, suddenly you've got off-by-one birthdates.

More succinctly, dates don't have timezones because they don't have times.

An example I like to use is medication reminders. For some, you need them to be at absolute time intervals: UTC is likely appropriate, and omitting time offsets could be disastrous. Others are more appropriate at periods of the experienced waking day, and using UTC could be extraordinarily disruptive eg while traveling.
Another case that pops up a lot is dealing with dates from third party systems that aren't timezone aware. Trying to convert them without knowing what timezone they were supposed to be in in the first place doesn't work, you have to just display them with no conversion.
Oh yeah I hesitate to even go into this because it’s been a huge source of traumatic burnout dealing with the fallout of it being mishandled.
The best real-world example I've heard of comes from a friend who works for a travel company.

You're creating a schedule for a tour. The tour occurs some time in the future. You have a daily itinerary for the tour group.

Now the destination country changes their time zone in some fashion. You don't want to have ever stored an instant, you want the date and time to be more abstract. 9am, as it will be understood at that date and time in the future.

The same also occurs for past events, but we don't tend to encounter that too often, although the calendar for September 1752 in the UK (at the time known as the Kingdom of Great Britain) shows another wrinkle

I swear that some variant of this bug happened to me in gCal. I set myself some yearly recurring events, I think a birthday or a names-day, and clearly marked it as an “All Day” event and I set it to repeat Yearly.

I don’t remember if it was in 1 or 2 or even more years, but Google calendar absolutely did make the event off-by-one day.

Naive date breaks down as soon as your domain expands the globe and interacts with local entities. It becomes important to normalize on some “0” index - utc, and convert to local times (which are often naive) as needed.
Storing (UTC, latitude, longitude, altitude) is the holy grail, I guess. Everything else is too leaky.

I mean, that's literally the four dimensions of space-time.

UTC is non-unique because of leap seconds, so TAI + lat + long + altitude is actually required.

I work on software for astronomy, and that quadruplet is what’s used to describe an observing location. You can actually get a little in trouble because of changes in the shape of the earth over time, so latitude and longitude and altitude need to be treated as time-dependent values, which matters once you are accounting for relativistic effects.

UTC times are unique: when a positive leap second is inserted, 23:59:60 is added; when a negative leap second is inserted, 23:59:59 is skipped.

Unix timestamps, on the other hand, are not unique.

That sort of assumes there's one time zone that's being used per spacetime coordinate, which isn't guaranteed. You can get political situations where de facto and de jure time diverge, or where different authorities nominally in charge of the time in a place disagree.

Lebanon seems to have experienced exactly this recently:

https://www.bbc.com/news/world-middle-east-65079574

It does unambiguously give you a spacetime coordinate (useful) but it doesn't unambiguously tell you what local time you should use for an occurrence, and the answer would really depend on who was asking.

> Storing (UTC, latitude, longitude, altitude) is the holy grail, I guess.

It’s not.

If I set up a meeting next year in NYC at 10 and the legislature decides to change the timezone’s offset, the meeting remains at 10 NY time on that date, it does not shift in NY time. It’s the UTC which shifts.

And UTC alone is sufficient for past events, as they are fixed instants in the time-stream. Unless you’re at a stage where you need to take relativistic effects in account, then you need to add the referential.