|
|
|
|
|
by itkovian_
5 days ago
|
|
This is called linear mode connectivity and seems to work for almost every large model. So well that in most cases it’s an explicit part of the training process; do many training ‘branches’ then merge then continue. It is not understood why it works so well. |
|