Hacker News new | ask | show | jobs
by bilsbie 811 days ago
I haven’t been able to make sense of model merging. Any insights?

Wouldn’t weights between models be completely different? And then there are architecture differences on top of that.

1 comments

Model merging is usually done with different fine-tunes of the same model. It doesn’t work if the base models are different.

One of the more surprising things is that you can actually repeat layers to improve model performance, ie 1-1-2-2 instead of 1-2. That’s how you get models with higher parameter counts than the original.

C.f. also Universal Transformer: the same layer stacked a lot. The sparse version of that is basically MoE with also a stick-breaking mechanism to prevent vanishing gradient while letting the model decide whether to terminate layer-count at a token early (ofc with training rewards to favor less layers, to represent the compute savings).