|
|
|
|
|
by rbabich
4981 days ago
|
|
There are a couple of (related) reasons. The first thing to recognize is that systems such as this one are designed for parallel workloads where all processes are running in lockstep, communicating via MPI with frequent barriers. This is very different from MapReduce and other asynchronous or "embarrassingly parallel" workloads where GFS, HDFS, etc. tend to be used. Distributed filesystems used in high-performance computing (such as Lustre, IBM's GPFS, etc.) also have to be able to handle both reads and writes with high throughput, whereas GFS is mostly optimized for reads and appends. Why not just install disks in the compute nodes and run Lustre there? Since all the nodes are working together in lockstep, system jitter is a major problem. Imagine that you have a job running across 10,000 nodes and 160,000 cores, and a process on one of those cores get preempted for a millisecond while a disk I/O request is being serviced. Everyone waits, and you've suddenly wasted 160 core-seconds. Now, if this happens only 1000 times per second across the whole machine, it's clear that you're not going to make much forward progress, and the whole system is going to run at very low efficiency. For this reason, Crays and similar large machines run a very minimal OS on the compute nodes (a linux-based "compute node kernel" in the case of Cray). Introducing local disks would go against the whole philosophy. There's also the issue of network contention. The network is typically the bottleneck, and you want to minimize the extent to which file I/O competes with your MPI traffic. As someone else mentioned, the solution is to have a dedicated storage system (often Lustre running on a semi-segregated cluster). This approach is used almost universally by the 500 systems on the Top 500 list (http://top500.org), for example. It's not just inertia :-). |
|
EDIT: I still dont buy it, but I will give some thought to the synchronous/lockstep nature of the environment.