Hacker News new | ask | show | jobs
by mirekrusin 1180 days ago
Unfortunately training is not emberassingly parallelisable [0] problem. It would require new architecture. Current models diverge too fast. By the time you'd download and/or calculate your contribution the model would descend somewhere else and your delta would not be applicable - based off wrong initial state.

It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.

Maybe you could dispatch tasks to train on many variations of similar tasks and take average of results? It could probably help in some way, but you'd still have large serialized pipeline to munch through and you'd likely require some serious hardware ie. dual gtx 4090 on client side.

[0] https://en.wikipedia.org/wiki/Embarrassingly_parallel

1 comments

hmmm... seems like you're reinventing distributed learning.

merge-ability does exist and you can average the results.

You can if you have same base weights.

If you have similar variants of the same task you can accelerate it more where the diff is.

You can't average on past results computed from historic base weights - it's linear process.

If you could do that, you'd just map training examples to diffs and merge them all.

Or take two distinct models and merge them to have model that is roughly sum of them. You can't do it, it's not linear process.

I did some bad use of words there "it's linear process" + "it's not linear process" :)

Let me clarify:

It's serialised, iterative, step repeating process where each step depends on output of previous one - aka linear process.

Where each step is non-linear transformation (gradient descent).

It's not distributable (over internet) task because it'd require transferring gigabytes of data (whole model weights) on each step.

To put it in other words - distributed task has massive input size and requires quick computation and tasks arrive very frequently - which means it can't be distributed over internet.

Distributed learning sucks for this type of models, averaging the results helps if you can do that often which requires very high bandwidth - i.e. the Infiniband interconnects between Nvidia pods which go up to 200 Gbps.