| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alchemist1e9 1163 days ago
	Can they mathematically be “mushed” and then create an improved model? I have yet to understand the difference between fine tuning and training and therefore yet to understand if a distributed decentralized eventually consistent training approach is a possibility or simply not realistic.

1 comments

tlb 1163 days ago

If you make N copies of a model, train them independently for a little while on N machines, and average them back together, it sort of works. But not if you train for very long, as the internal structure diverges.

It becomes an empirical engineering question how many parallel nodes you can train on for how long before averaging them back together. It's an expensive question to answer, since you have to train many variations to get the data.

link

alchemist1e9 1163 days ago

I was thinking if you can fine tune / train on a restricted subspace of the weights? If so they one can assign specific partitioned subspaces and then the averaging wouldn’t overlap, however maybe that would destroy some valuable cohesion.

link

tlb 1162 days ago

I haven't heard of that being tried (though I don't read everything.) Someone could do the experiment and write it up, and maybe get it published. The main ML conferences rarely publish anything that's not an improvement on the SOTA, which is why it's so hard to find anything about ideas that don't quite work.

link

alchemist1e9 1162 days ago

The underlying motivation to my thoughts and comments is investigating if a decentralized but periodically coordinated algorithm for training LLMs exists. We have millions of GPUs distributed across the world which if they could somehow be put to work on training without extreme requirements on data transfer between them could enable training of large LLMs in an open source way even if that training is technically energy suboptimal.

link

whimsicalism 1162 days ago

Yeah, your intuition that this would destroy cohesion is correct.

It's basically not possible to do what you are trying to do in an async manner. With advancements in large batch gradients, it might be possible to do some sort of synchronous P2P gradient averaging.

link

alchemist1e9 1162 days ago

Edit: I’m reading this to try and get some sense of the issues - https://www.amazon.science/blog/near-linear-scaling-of-gigan...

What about with some fairly frequent and periodic synchronization?

Is there potentially some balance where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion. I was thinking maybe this algorithm would be 10x less energy efficient but have the benefit of decentralization. Something along those lines.

I’m guessing the current training algorithms do something like this but since rapid synchronization always makes the efficiency increase (in the extreme that giant single wafer cpu) then openAI and others use systems with high interconnect bandwidth.

link

whimsicalism 1162 days ago

I am not familiar with that work.

> where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion

I think this really probably depends on the terrain of your loss landscape. My intuition is that many are too spike-y and if you take a step or two in each of your subsets and then average them, you will end up on a steep hill rather than a valley between your two points.

But this is an active area of research for sure.

link