| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kermatt 1044 days ago
	Moving between Spark and Pandas can cause type casting as well. For example the range of allowable dates in Pandas is much smaller than in Spark. We completely abandoned Pandas in favor of PySpark for this reason. It seems unnecessary to use multiple dataframe implementations when Spark is already in play.

1 comments

smcin 1043 days ago

Are you referring to pandas.Timestamp.max being 2262-04-11 23:47:16.854775807 ?

https://pandas.pydata.org/docs/reference/api/pandas.Timestam...

(pandas design choice was to support nanosecond times, for financial data.)

link

kermatt 1041 days ago

Yes. Unfortunately I’m dealing with an app that likes to use multiple magic dates way past the Pandas range.

link

smcin 1040 days ago

"much smaller range" seems disingenuous without saying that you mean "not beyond 2262". And you said those aren't real dates, only magic dates or sentinels. So that's a totally artificial requirement. And you could fix the magic dates up at conversion with a simple replacement script.

* MS-DOS supports dates from 1/1/1980 to 12/31/2099

* 32b Linux (or Windows 7) supported timestamps up to 2038

* 64b timestamps fixed all thia already, and presumably OSes will be using 128b datetimes well before 2099 if not sooner.

link

kermatt 1040 days ago

The RDBMs in this case accepts 9999-12-31 as a valid date. Pandas does not. This is where the issue came in, and switching to PySpark meant we needed no date manipulation to handle the data supplied by the upstream.

Magic dates suck, but they exist in the wild. There are also valid cases where data is not tied to the lifetimes of humans currently writing code.

The range of values for date values in PostgreSQL is 4713 BC to 5874897 AD:

https://www.postgresql.org/docs/current/datatype-datetime.ht...

link

smcin 1039 days ago

Ah I see your point. Yeah I noticed SQL goes up to 9999.

link