| HN Mirror

After skimming the paper, I'm fairly confident it's not the same at all. We only managed the theoretical side of a scenario where there would be multiple TB hard drives, on multiple machines. Any efficient algorithm would work in a scanning manner, and not seek backwards beyond what could be kept in ram. We did simulate this, and the result was quite clear, IO matters.

From the paper the following 3 quotes highlight exactly why they where CPU bound:

> We found that if we instead ran queries on uncompressed data, most queries became I/O bound

> is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object

> for some queries, as much as half of the CPU time is spent deserializing and decompressing data