|
|
|
|
|
by ap22213
3578 days ago
|
|
One of the hardest parts for me was getting the cluster sized appropriately so that all data stayed in memory. Overflow to disk slows things down a lot. But, sizing the cluster can be tricky if you're generating a lot of data structures in the tasks. I only use RDDs, so I put a lot of thought into the processing flow so that unnecessary data wasn't shuffled. If you're reducing PairRDDs, make sure that the data is evenly distributed. Also, I'm guessing you read the optimization docs, but a huge amount of network I/O can be reduced by choosing the right types and collections and optimizing serialization. And, of course group within partitions first, then within nodes, then across nodes. And, of course, go for fewer bigger servers with lots of network bandwidth. There are a lot of tricks, unfortunately. And, since I don't know your experience level I won't bore you with things you probably already know. |
|
Is it only applicable once the "cluster [is] sized appropriately so that all data stayed in memory" as you mention?