| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jupiter90000 3458 days ago
	Different example, doing a simple 'group by' sparksql query on only about 20 million rows on a distributed phoenix/hbase table couldn't even be completed because of spark dumbly shuffling all the data around the cluster. Spark/phoenix RDD drivers apparently had no 'group by' push down support for phoenix so shuffled all the data amazingly inefficiently. Running the same query directly on phoenix took all of about a minute to finish. My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.

2 comments

j-m-o 3457 days ago

Hey, I'm the phoenix-spark author here. You're totally right, right now there is a lot of dumb shuffling around for certain operations. Hopefully some of that will get fixed up in the next release [1].

[1] https://issues.apache.org/jira/browse/PHOENIX-3600

link

dandermotj 3458 days ago

You're trying to GROUP BY on a distributed data store; your code is the problem, not Spark SQL. Use CLUSTER BY - it's distributed sibling.

Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.

link

jupiter90000 3457 days ago

Edit: correct me if I'm wrong, it doesn't appear that 'cluster by' avoids a costly shuffle first. I'd rather just push down to the database engine, when using a database engine.. group by worked fine on phoenix, so saying my code is the problem means it's really only a problem when using sparksql with the phoenix RDD driver.

link