Hacker News new | ask | show | jobs
by cpldcpu 735 days ago
It's just continued pretraining to "heal" the damage caused by switching the activation functions and enforcing sparsity.

Apparently they managed to recover original performance on standardized tests after continuing pretraining with the 150B tokens. There may be some more specialized knowledge lost that was not covered by their dataset.