|
|
|
|
|
by bdarfler
3892 days ago
|
|
There is always a balance here between CPU and IO. For a long time databases and big data platforms were pretty terrible with IO. However, as the computer engineering community has had time to work with these problems we have gotten considerably better at understanding how to store data via sorted and compressed columnar formats how to exploit data locality via segmentation and partitioning. As such most well constructed big data products are CPU bound at this point. For instance check out the NSDI `15 paper on Spark performance that found it was CPU bound. Vertica is also generally CPU bound. https://www.usenix.org/conference/nsdi15/technical-sessions/... |
|
From the paper the following 3 quotes highlight exactly why they where CPU bound:
> We found that if we instead ran queries on uncompressed data, most queries became I/O bound
> is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object
> for some queries, as much as half of the CPU time is spent deserializing and decompressing data