| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by milcron 3350 days ago

A great paper which delves into different approaches for parallel computing is "Three layer cake for shared-memory programming" [0]. They characterize parallel programming into three broad strategies:

1. SIMD (parallel lines)

2. Fork-Join (a directed acyclic graph of operations)

3. Message-Passing (a graph of operations)

GPUs are great at SIMD, but bad at the other sorts of parallelism.

[0] https://www.researchgate.net/publication/228683178_Three_lay...

2 comments

jhj 3350 days ago

You can express some forms of #2 or #3 well on a GPU. It depends upon how wide the graph of tasks is (maximum number of concurrent tasks possible in the graph).

On Nvidia GPUs, 16 to 32 warps per SM x 60 SMs on a P100 gives a lot of hardware threads (1 thread == 1 warp) in flight at once; these are allowed to branch completely independent of each other (I forget the maximum occupancy of a P100's SM in warps at lowest resource use). Furthermore, you can use global memory atomics and spin-locks for event driven programming, work-stealing, etc. This kind of stuff is used in, e.g., persistent kernels. Of course, the single kernel that is being run must handle all of the code for all of the tasks. Not easy to write, but possible.

link

scott_s 3350 days ago

This is a good paper, but not quite how I think about it. I use the terms data-parallel (for SIMD), task-parallel (for fork-join; kinda) and message passing. GPUs are basically data-parallel machines, but over the years, GPUs have been getting more and more capable, so I imagine some people out there are using them for task-parallel workloads.

link

claytonjy 3350 days ago

Would tensorflow (or similar) count as task-parallel because the computation graph is a DAG? If so, there's a pretty popular example of task-parallelism running on GPU's.

link

chubot 3350 days ago

I would say TensorFlow is a hybrid of two strategies: SIMD and dataflow/DAG. (I wouldn't say fork-join and dataflow/DAG are synonymous; rather they are related but different models/APIs).

At the level of a single node, TensorFlow uses Eigen [1]. Eigen is like BLAS, but it's a C++ template library rather than Fortran. It compiles to various flavors of SIMD. Nvidia's proprietary CUDA is the SIMD flavor most commonly used by TensorFlow programs.

At the level of multiple nodes, TensorFlow derives a program graph from your Python source code, using high level "ops", in the style of NumPy. Then it distributes the ops across a cluster using a scheduler:

Quote: Its dataflow scheduler, which is the component that chooses the next node to execute, uses the same basic algorithm as Dryad, Flume, CIEL, and Spark. [2]

Python is the "control plane" and not the "data plane" -- it describes the logic and dataflow of the program, but doesn't touch the actual data. When you use NumPy, the C code and BLAS code are the data plane. When you use TensorFlow, the Eigen and GRPC/protobuf distribution layer are the data plane.

So you can have a big data dataflow system WITHOUT SIMD, like the four systems mentioned in the quote. And you can have SIMD without dataflow, i.e. if you are doing it in pure Eigen or procedural/functional R/Matlab/Julia on a single machine. Languages like R and Julia may have dataflow extensions, but they're single-threaded/procedural by default as far as I know.

A mathematical way to think of the DAG model is where you program uses a partial order on computations rather than a total order (the procedural model) -- this is what give gives you parallelism.

So TensorFlow uses both SIMD and dataflow.

[1] http://eigen.tuxfamily.org/index.php?title=Main_Page

[2] http://download.tensorflow.org/paper/whitepaper2015.pdf

link

scott_s 3350 days ago

Good point! Which reminds me that I left off pipeline-parallelism, which is very common in dataflow programming models. And Tensorflow is a dataflow model. But I think that the core computation in Tensorflow programs will tend to be largely data-parallel affairs. That is, I think such programs tend to have a bunch of data-parallel computational kernels connected in a DAG. When I made that comment, I was thinking more of a Cilk style program.

(I work on a dataflow language and system.)

link

mining 3350 days ago

I'd say no, since the purpose of the GPU there is to make the matmul really fast.

link