|
|
|
|
|
by hvidgaard
3875 days ago
|
|
After skimming the paper, I'm fairly confident it's not the same at all. We only managed the theoretical side of a scenario where there would be multiple TB hard drives, on multiple machines. Any efficient algorithm would work in a scanning manner, and not seek backwards beyond what could be kept in ram. We did simulate this, and the result was quite clear, IO matters. From the paper the following 3 quotes highlight exactly why they where CPU bound: > We found that if we instead ran queries on uncompressed data, most queries became I/O bound > is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object > for some queries, as much as half of the CPU time is spent deserializing and decompressing data |
|