Hacker News new | ask | show | jobs
by valine 811 days ago
The architecture changes are very straight forward. Model merging has shown that pre-trained transformer layers are very robust. I’ll bet it’s possible to fine tune a pre-trained model like mistral to use this architecture. That would enable someone to test it with more parameters without training a whole new base model.
2 comments

They try this in the appendix without success, unfortunately. It seems having this enabled early on in training is important.
We're still working on training the DWA weights on top of a pretained model. We're hopeful that this is feasible. The experiments you're mentioning in the appendix are not changing the learning rate scheduler. E.g., when starting to train the DWA weights after 20k iterations, the learning rate is already quite small. To some extent, this might explain the diminishing returns. Maybe this could work with a completely different learning rate scheduler.
Yeah, you can't change the model much with low LRs. That's the point! Same reason you don't get continual-learning if you just keep using low LRs: https://arxiv.org/abs/2403.08763 You need to really shake up the model if you want to learn some genuinely better (ie. different) internal representations that exploits the DenseNet (https://arxiv.org/abs/1608.06993)/LTG-BERT (https://arxiv.org/abs/2311.02265) arch you're using here.
I haven’t been able to make sense of model merging. Any insights?

Wouldn’t weights between models be completely different? And then there are architecture differences on top of that.

Model merging is usually done with different fine-tunes of the same model. It doesn’t work if the base models are different.

One of the more surprising things is that you can actually repeat layers to improve model performance, ie 1-1-2-2 instead of 1-2. That’s how you get models with higher parameter counts than the original.

C.f. also Universal Transformer: the same layer stacked a lot. The sparse version of that is basically MoE with also a stick-breaking mechanism to prevent vanishing gradient while letting the model decide whether to terminate layer-count at a token early (ofc with training rewards to favor less layers, to represent the compute savings).