Hacker News new | ask | show | jobs
by whalesalad 1180 days ago
Are there any training/ownership models like Folding@Home? People could donate idle GPU resources in exchange for access to the data, and perhaps ownership. Then instead of someone needing to pony up $85k to train a model, a thousand people can train a fraction of the model on their consumer GPU and pool the results, reap the collective rewards.
10 comments

A few people have built frameworks to do this.

There is still a very large open problem in how to federate large numbers of loosely coupled computers to speed up training "interesting" models. I've worked in both domains (protein folding via Folding@Home/protein folding using supercomputers, and ML training on single nodes/ML training on supercomputers) and at least so far, ML hasn't really been a good match for embarrassingly parallel compute. Even in protein folding, folding@home has a number of limitations that are much better addressed on supercomputers (for example: if your problem requires making extremely long individual simulations of large proteins).

All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.

And you can rule out most of the monte carlo stuff too. Which rules out parallelization modern statistical frameworks like STAN used for explainable models; things like Finance modeling of risk which is a sampling of posteriors using MCMC also can't be parallelized.
Assuming the chains can reach an equilibrium point (i.e. burn in) quickly, M samples from an MCMC can be parallelized by running N chains in parallel each for M/N iterations. You still end up with M total samples from your target distribution.

You’re only out of luck if each iteration is too compute intense to fit on one worker node, even if each iteration might be embarrassingly parallelizable, since the overhead of having to aggregate computations across workers at every iteration would be too high.

In reality the number of chains is not that many though, right? I've seen 3-4 chains in models in STAN that can do the job on most smallish(econ, social sciences) datasets, though I maybe wrong about other domains...
Probably going to mirror the transition from single-threaded to multi-threaded compute. Took a while until application architectures took hold of the populous to utilize multi-core.
Probably not. Multicore has been a thing for 30 years (We had a 32 core Sequent Systems and a 64 core KSR-1 at UW CS&E in the early 1990s). Everything about these models has been developed in a multicore computing context, and thus far, it still isn't massively-parallel-distributable. An algorithm can be massively parallel without being sensibly distributable. Change the latency between compute nodes is not always a neutral or even just linear decrease in performance.
Unfortunately training is not emberassingly parallelisable [0] problem. It would require new architecture. Current models diverge too fast. By the time you'd download and/or calculate your contribution the model would descend somewhere else and your delta would not be applicable - based off wrong initial state.

It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.

Maybe you could dispatch tasks to train on many variations of similar tasks and take average of results? It could probably help in some way, but you'd still have large serialized pipeline to munch through and you'd likely require some serious hardware ie. dual gtx 4090 on client side.

[0] https://en.wikipedia.org/wiki/Embarrassingly_parallel

hmmm... seems like you're reinventing distributed learning.

merge-ability does exist and you can average the results.

You can if you have same base weights.

If you have similar variants of the same task you can accelerate it more where the diff is.

You can't average on past results computed from historic base weights - it's linear process.

If you could do that, you'd just map training examples to diffs and merge them all.

Or take two distinct models and merge them to have model that is roughly sum of them. You can't do it, it's not linear process.

I did some bad use of words there "it's linear process" + "it's not linear process" :)

Let me clarify:

It's serialised, iterative, step repeating process where each step depends on output of previous one - aka linear process.

Where each step is non-linear transformation (gradient descent).

It's not distributable (over internet) task because it'd require transferring gigabytes of data (whole model weights) on each step.

To put it in other words - distributed task has massive input size and requires quick computation and tasks arrive very frequently - which means it can't be distributed over internet.

Distributed learning sucks for this type of models, averaging the results helps if you can do that often which requires very high bandwidth - i.e. the Infiniband interconnects between Nvidia pods which go up to 200 Gbps.
Yes there is petals/bloom https://github.com/bigscience-workshop/petals but it's not so great. Maybe it will improve or a better one will come.
I read that it is only scoring the model collaboratively but it allows some fine-tuning I guess.

Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different?

Really interesting live monitor of the network: http://health.petals.ml
I wonder how they handle illegal content. Like, if you're running training data on your computer, what's to stop someone else's data that is illegal, from being uploaded to your computer as part of training?
That’d be cool but I don’t think most idle consumer GPUs (6-8GB) would have large enough memory for a single iteration (batch size 1) of modern LLMs.

But I’d love to see more federated/distributed learning platforms.

6GB can store 3 billion parameters, gpt3.5 has 175 billion parameters.
Is it possible to break the model apart? Or does the entire thing need to be architected from the get-go such that an individual GPU can own a portion end to end?
It's possible to break the model apart (I mean, for the larger models it's not that a 8Gb card isn't enough but even a single 80Gb card isn't enough) but that needs a high-speed interconnect (Nvidia pods provide hundreds of Gbps, and use all of that) as you need to exchange those parameters quite often, so you're just as limited by your compute as you are by the interconnect speed.
The main reason an arbitrarily distributed set of compute nodes cannot give you good performance for training a model (even if you have an immodest number of nodes), is that the latency of the inter-node communication will be a massive bottleneck. GPU cloud providers shell out big bucks for ultra fast intra-DC networking via infiniband and the like, and the networking is paid attention to as much (if not more sometimes) than the capabilities of the nodes themselves.
How long until somebody creates a crypto project on that?
Bittensor is one, not an endorsement. chat.bittensor.com
Every parameter needs to reach every other parameter. Ideally enough core memory for that. But their tiling algorithms.
This is how you get skynet.