|
|
|
|
|
by iskander
3695 days ago
|
|
In theory, Spark lets you seamlessly write parallel computations without sacrificing expressivity. You perform collections-oriented operations (e.g. flatMap, groupBy) and the computation gets magically distributed across a cluster (alongside all necessary data movement and failure recovery). In practice, Spark seems to perform reasonably well on smaller in-memory datasets and on some larger benchmarks under the control of Databricks. My experience has been pretty rough for legitimately large datasets (can't fit in RAM across a cluster) -- mysterious failures abound (often related to serialization, fat in-memory representations, and the JVM heap). The project has been slowly moving toward an improved architecture for working with larger datasets (see Tungsten and DataFrames), so hopefully this new release will actually deliver on the promise of Spark's simple API. |
|