|
|
|
|
|
by anonymousDan
4071 days ago
|
|
I'm not sure I completely buy their claim that network/disk IO is no longer the bottleneck in many situations. Their recent NSDI paper (https://kayousterhout.github.io/trace-analysis/) is compute bound for a 20 node cluster, but they never evaluate it in a larger cluster with fixed data size. The availability of on-demand cloud resources would potentially make it easier to solve the bottleneck by just increasing the cluster size. |
|
Given a set of cheap SSDs and a good I/O scheduler, very few workloads should bottleneck on disk because the disk has more available bandwidth than the network. If you are bottlenecking on disk, it usually means the I/O scheduling is poor. The operating system I/O scheduler is quite poor for many use cases, hence why I/O intensive apps write their own.
Network can be inexpensively and easily saturated at 10 GbE these days, even when doing something ugly like parsing JSON streams. Unless there is a bottleneck elsewhere in the system, like memory bandwidth or occasionally CPU, this is where I typically see bottlenecks in real systems. However, switch fabrics do not scale infinitely even if you know what you are doing, so there is that, and bandwidth does not always scale linearly (though things like graph analysis are much closer to effectively linear in sophisticated implementations than you see using naive algorithms).
Data motion is always the bottleneck. Right now, moving from RAM to CPU or from machine to machine is almost always the bottleneck if you are doing everything else right. Many popular software designs and implementations for "big data" have many other bottlenecks that are not strictly necessary, so that throws off expectations e.g. poor I/O scheduling saturating disk or poor JSON parsing burning up CPU or poor use of threading wasting a lot of CPU time.