| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hcrisp 4339 days ago
	Good news about the PySpark input format improvements. Does that also cover reading complex Parquet datatypes into SchemaRDDs with their native datatypes? When can we get a Databricks Cloud account (I'm already on the waiting list)?

2 comments

pwendell 4339 days ago

Yeah, you can load Parquet data directly into SchemaRDD's in 1.1 and get the type conversion, including use of nested types. That's the long term solution for all of our storage integration is to go through the SchemaRDD API since it's a standard type description and we expect many data sources to integrate there.

Re: databricks cloud - shoot me an e-mail and I'll see if I can help. Right now demand exceeds supply for us on accounts, but I can try!

link

ambrood 4339 days ago

Don't the SchemaRDD already support Parquet? Although it'd be great if they supported CSVs.

link

JoshRosen 4339 days ago

There's work in progress to support importing CSV data as SchemaRDDs:

https://issues.apache.org/jira/browse/SPARK-2360 https://github.com/apache/spark/pull/1351

link