|
|
|
|
|
by ignoreusernames
516 days ago
|
|
just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell: spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
spark.read.parquet("sample_data").groupBy($"col").count().count()
after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data. |
|