|
|
|
|
|
by panarky
4276 days ago
|
|
The 100 terabyte benchmark used 206 Spark nodes, compared with 2100 Hadoop nodes. Going up to 1 petabyte, the Hadoop comparison adds more nodes, 3800, while the Spark benchmark actually reduced the number of nodes to 190. Does Spark scale well beyond ~200 nodes, or does the network become the bottleneck? In any case, it's an impressive result considering that they didn't use Spark's in-memory cache. |
|
> [O]ur Spark cluster was able to sustain ... 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines.
If the network is the bottleneck it makes sense to reduce the number of nodes to reduce the network communications.