Hacker News new | ask | show | jobs
by hvidgaard 3888 days ago
The major takeaway I had from my courses in data intensive applications, was that IO is all that matters. It is the limiting factor to such an extend that you don't really care about the algorithmic efficiency with regards to CPU calculations, or memory.

You analyse algorithms in terms of IO access, and specifically access pattern. If you cannot make the algorithm in a scanning fashion, you're in for a bad time.

1 comments

There is always a balance here between CPU and IO. For a long time databases and big data platforms were pretty terrible with IO. However, as the computer engineering community has had time to work with these problems we have gotten considerably better at understanding how to store data via sorted and compressed columnar formats how to exploit data locality via segmentation and partitioning. As such most well constructed big data products are CPU bound at this point. For instance check out the NSDI `15 paper on Spark performance that found it was CPU bound. Vertica is also generally CPU bound.

https://www.usenix.org/conference/nsdi15/technical-sessions/...

After skimming the paper, I'm fairly confident it's not the same at all. We only managed the theoretical side of a scenario where there would be multiple TB hard drives, on multiple machines. Any efficient algorithm would work in a scanning manner, and not seek backwards beyond what could be kept in ram. We did simulate this, and the result was quite clear, IO matters.

From the paper the following 3 quotes highlight exactly why they where CPU bound:

> We found that if we instead ran queries on uncompressed data, most queries became I/O bound

> is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object

> for some queries, as much as half of the CPU time is spent deserializing and decompressing data