Hacker News new | ask | show | jobs
by massel 3347 days ago
Possibly a dumb question, but if it's this embarrassingly parallel, wouldn't this be a workload more suited for calculation on a GPU? I'm assuming there's a good reason he's not using one, so could someone who understands this a little better explain it?
3 comments

> if it's this embarrassingly parallel, wouldn't this be a workload more suited for calculation on a GPU?

Maybe, but one does not necessarily follow from the other. Consider the task of compiling 1 million separate C++ projects. That is obviously embarrassingly parallel, but not well suited for a GPU. It's trivial to do many compilations at once, but compiling itself is not easy to parallelize.

That example is obviously contrived, but I think it demonstrates the principle that it's the computational profile of the core problem that will determine if you can use a GPU. If the core problem requires 10s of GB of RAM, or it's excessively branchy code, it may not be well suited for a GPU.

A great paper which delves into different approaches for parallel computing is "Three layer cake for shared-memory programming" [0]. They characterize parallel programming into three broad strategies:

1. SIMD (parallel lines)

2. Fork-Join (a directed acyclic graph of operations)

3. Message-Passing (a graph of operations)

GPUs are great at SIMD, but bad at the other sorts of parallelism.

[0] https://www.researchgate.net/publication/228683178_Three_lay...

You can express some forms of #2 or #3 well on a GPU. It depends upon how wide the graph of tasks is (maximum number of concurrent tasks possible in the graph).

On Nvidia GPUs, 16 to 32 warps per SM x 60 SMs on a P100 gives a lot of hardware threads (1 thread == 1 warp) in flight at once; these are allowed to branch completely independent of each other (I forget the maximum occupancy of a P100's SM in warps at lowest resource use). Furthermore, you can use global memory atomics and spin-locks for event driven programming, work-stealing, etc. This kind of stuff is used in, e.g., persistent kernels. Of course, the single kernel that is being run must handle all of the code for all of the tasks. Not easy to write, but possible.

This is a good paper, but not quite how I think about it. I use the terms data-parallel (for SIMD), task-parallel (for fork-join; kinda) and message passing. GPUs are basically data-parallel machines, but over the years, GPUs have been getting more and more capable, so I imagine some people out there are using them for task-parallel workloads.
Would tensorflow (or similar) count as task-parallel because the computation graph is a DAG? If so, there's a pretty popular example of task-parallelism running on GPU's.
I would say TensorFlow is a hybrid of two strategies: SIMD and dataflow/DAG. (I wouldn't say fork-join and dataflow/DAG are synonymous; rather they are related but different models/APIs).

At the level of a single node, TensorFlow uses Eigen [1]. Eigen is like BLAS, but it's a C++ template library rather than Fortran. It compiles to various flavors of SIMD. Nvidia's proprietary CUDA is the SIMD flavor most commonly used by TensorFlow programs.

At the level of multiple nodes, TensorFlow derives a program graph from your Python source code, using high level "ops", in the style of NumPy. Then it distributes the ops across a cluster using a scheduler:

Quote: Its dataflow scheduler, which is the component that chooses the next node to execute, uses the same basic algorithm as Dryad, Flume, CIEL, and Spark. [2]

Python is the "control plane" and not the "data plane" -- it describes the logic and dataflow of the program, but doesn't touch the actual data. When you use NumPy, the C code and BLAS code are the data plane. When you use TensorFlow, the Eigen and GRPC/protobuf distribution layer are the data plane.

So you can have a big data dataflow system WITHOUT SIMD, like the four systems mentioned in the quote. And you can have SIMD without dataflow, i.e. if you are doing it in pure Eigen or procedural/functional R/Matlab/Julia on a single machine. Languages like R and Julia may have dataflow extensions, but they're single-threaded/procedural by default as far as I know.

A mathematical way to think of the DAG model is where you program uses a partial order on computations rather than a total order (the procedural model) -- this is what give gives you parallelism.

So TensorFlow uses both SIMD and dataflow.

[1] http://eigen.tuxfamily.org/index.php?title=Main_Page

[2] http://download.tensorflow.org/paper/whitepaper2015.pdf

Good point! Which reminds me that I left off pipeline-parallelism, which is very common in dataflow programming models. And Tensorflow is a dataflow model. But I think that the core computation in Tensorflow programs will tend to be largely data-parallel affairs. That is, I think such programs tend to have a bunch of data-parallel computational kernels connected in a DAG. When I made that comment, I was thinking more of a Cilk style program.

(I work on a dataflow language and system.)

I'd say no, since the purpose of the GPU there is to make the matmul really fast.
That's not a dumb question. As alluded to below, for most people the main challenge is their software stack. Beyond that, 200k+ vcpus is still a ton so you'd need hundreds to thousands of GPUs to just match the flops (Note: I don't know if Drew's code is vectorised or not on CPUs).

Put together that by using Preemptible VMs (and yes, apologies we still don't offer GPUs as preemptible) it's economically rational to use spare CPUs.

Disclosure: I work on Google Cloud.

GPU's are significantly more expensive on GCE, could just be a cost thing.