| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rbabich 4981 days ago

There are a couple of (related) reasons. The first thing to recognize is that systems such as this one are designed for parallel workloads where all processes are running in lockstep, communicating via MPI with frequent barriers. This is very different from MapReduce and other asynchronous or "embarrassingly parallel" workloads where GFS, HDFS, etc. tend to be used. Distributed filesystems used in high-performance computing (such as Lustre, IBM's GPFS, etc.) also have to be able to handle both reads and writes with high throughput, whereas GFS is mostly optimized for reads and appends.

Why not just install disks in the compute nodes and run Lustre there? Since all the nodes are working together in lockstep, system jitter is a major problem. Imagine that you have a job running across 10,000 nodes and 160,000 cores, and a process on one of those cores get preempted for a millisecond while a disk I/O request is being serviced. Everyone waits, and you've suddenly wasted 160 core-seconds. Now, if this happens only 1000 times per second across the whole machine, it's clear that you're not going to make much forward progress, and the whole system is going to run at very low efficiency. For this reason, Crays and similar large machines run a very minimal OS on the compute nodes (a linux-based "compute node kernel" in the case of Cray). Introducing local disks would go against the whole philosophy.

There's also the issue of network contention. The network is typically the bottleneck, and you want to minimize the extent to which file I/O competes with your MPI traffic.

As someone else mentioned, the solution is to have a dedicated storage system (often Lustre running on a semi-segregated cluster). This approach is used almost universally by the 500 systems on the Top 500 list (http://top500.org), for example. It's not just inertia :-).

1 comments

paulsutter 4981 days ago

Disk io has negligible CPU overhead. Preempted for a millisecond? A millisecond is millions of instructions. You're off by orders of magnitude. No matter where the disk is located, the disk io has to go across the network. If network capacity truly is the bottleneck, you have a different design problem and you cant exploit the CPUs.

EDIT: I still dont buy it, but I will give some thought to the synchronous/lockstep nature of the environment.

link

rbabich 4981 days ago

That was a straw man (sorry). There's also the overhead of maintaining consistency, synchronizing metadata, etc. I don't think assuming 0.1% CPU overhead for Lustre is a terrible estimate, but even if it were much lower, the argument would still hold (at least at the scale of Titan).

link