Hacker News new | ask | show | jobs
by jey 4402 days ago
> Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap?

I'd be interested to hear more about your use case and the problems you encountered. It's possible that you need to do some kind of .coalesce() operation to rebalance the partitions if you have unbalanced partition sizes.

1 comments

Well, the RDD is initially partitioned using a RangePartitioner over a dense key space of Longs. Each element is then expanded ~100x (each object is significantly smaller than the original value). So the total memory footprint and skew of the expanded RDD shouldn't, theoretically, be a problem.