Hacker News new | ask | show | jobs
by tma-1 3679 days ago
I have been extensively using the dateframe/sql API and I just love it. Most of the issues I have had stemmed from the cluster / Spark configuration and not the API itself. Using SQL is so much more intuitive them using multiple joins, selects, filter etc on an rdd.
1 comments

So I did find it useful for doing additional exploratory aggregations once the data was already cleaned and denormalized. My comment was more directed at the upfront initial data processing (In our case, extracting time series data out of a large amount of files).

I did hit issues w/ multiple joins and shuffling though. Have you not hit issues w/ shuffling?

I was using Spark 1.5.1 for the record.

Have you tried tuning Spark's memory parameters?