Hacker News new | ask | show | jobs
by dekhn 1180 days ago
A few people have built frameworks to do this.

There is still a very large open problem in how to federate large numbers of loosely coupled computers to speed up training "interesting" models. I've worked in both domains (protein folding via Folding@Home/protein folding using supercomputers, and ML training on single nodes/ML training on supercomputers) and at least so far, ML hasn't really been a good match for embarrassingly parallel compute. Even in protein folding, folding@home has a number of limitations that are much better addressed on supercomputers (for example: if your problem requires making extremely long individual simulations of large proteins).

All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.

2 comments

And you can rule out most of the monte carlo stuff too. Which rules out parallelization modern statistical frameworks like STAN used for explainable models; things like Finance modeling of risk which is a sampling of posteriors using MCMC also can't be parallelized.
Assuming the chains can reach an equilibrium point (i.e. burn in) quickly, M samples from an MCMC can be parallelized by running N chains in parallel each for M/N iterations. You still end up with M total samples from your target distribution.

You’re only out of luck if each iteration is too compute intense to fit on one worker node, even if each iteration might be embarrassingly parallelizable, since the overhead of having to aggregate computations across workers at every iteration would be too high.

In reality the number of chains is not that many though, right? I've seen 3-4 chains in models in STAN that can do the job on most smallish(econ, social sciences) datasets, though I maybe wrong about other domains...
Probably going to mirror the transition from single-threaded to multi-threaded compute. Took a while until application architectures took hold of the populous to utilize multi-core.
Probably not. Multicore has been a thing for 30 years (We had a 32 core Sequent Systems and a 64 core KSR-1 at UW CS&E in the early 1990s). Everything about these models has been developed in a multicore computing context, and thus far, it still isn't massively-parallel-distributable. An algorithm can be massively parallel without being sensibly distributable. Change the latency between compute nodes is not always a neutral or even just linear decrease in performance.