Hacker News new | ask | show | jobs
by hcrisp 4292 days ago
Good news about the PySpark input format improvements. Does that also cover reading complex Parquet datatypes into SchemaRDDs with their native datatypes? When can we get a Databricks Cloud account (I'm already on the waiting list)?
2 comments

Yeah, you can load Parquet data directly into SchemaRDD's in 1.1 and get the type conversion, including use of nested types. That's the long term solution for all of our storage integration is to go through the SchemaRDD API since it's a standard type description and we expect many data sources to integrate there.

Re: databricks cloud - shoot me an e-mail and I'll see if I can help. Right now demand exceeds supply for us on accounts, but I can try!

Don't the SchemaRDD already support Parquet? Although it'd be great if they supported CSVs.
There's work in progress to support importing CSV data as SchemaRDDs:

https://issues.apache.org/jira/browse/SPARK-2360 https://github.com/apache/spark/pull/1351