I did some bad use of words there "it's linear process" + "it's not linear process" :)
Let me clarify:
It's serialised, iterative, step repeating process where each step depends on output of previous one - aka linear process.
Where each step is non-linear transformation (gradient descent).
It's not distributable (over internet) task because it'd require transferring gigabytes of data (whole model weights) on each step.
To put it in other words - distributed task has massive input size and requires quick computation and tasks arrive very frequently - which means it can't be distributed over internet.
Distributed learning sucks for this type of models, averaging the results helps if you can do that often which requires very high bandwidth - i.e. the Infiniband interconnects between Nvidia pods which go up to 200 Gbps.
If you have similar variants of the same task you can accelerate it more where the diff is.
You can't average on past results computed from historic base weights - it's linear process.
If you could do that, you'd just map training examples to diffs and merge them all.
Or take two distinct models and merge them to have model that is roughly sum of them. You can't do it, it's not linear process.