| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tma-1 3679 days ago
	I have been extensively using the dateframe/sql API and I just love it. Most of the issues I have had stemmed from the cluster / Spark configuration and not the API itself. Using SQL is so much more intuitive them using multiple joins, selects, filter etc on an rdd.

1 comments

mastratton3 3679 days ago

So I did find it useful for doing additional exploratory aggregations once the data was already cleaned and denormalized. My comment was more directed at the upfront initial data processing (In our case, extracting time series data out of a large amount of files).

I did hit issues w/ multiple joins and shuffling though. Have you not hit issues w/ shuffling?

I was using Spark 1.5.1 for the record.

link

tma-1 3679 days ago

Have you tried tuning Spark's memory parameters?

link