|
|
|
|
|
by cpldcpu
735 days ago
|
|
It's just continued pretraining to "heal" the damage caused by switching the activation functions and enforcing sparsity. Apparently they managed to recover original performance on standardized tests after continuing pretraining with the 150B tokens. There may be some more specialized knowledge lost that was not covered by their dataset. |
|