Hacker News new | ask | show | jobs
by russianGuy83829 734 days ago
It seems like this can’t run all models, and needs custom ones trained from scratch: “ We introduce two new models: TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B. These models are sparsified versions of Mistral and Mixtral […]. Notbly, our models are trained with just 150B tokens within just 0.1M dollars”.

It remains to be seen how good these custom models are.

3 comments

Paper for the sparcified mixtral models

https://arxiv.org/abs/2406.05955

It's just continued pretraining to "heal" the damage caused by switching the activation functions and enforcing sparsity.

Apparently they managed to recover original performance on standardized tests after continuing pretraining with the 150B tokens. There may be some more specialized knowledge lost that was not covered by their dataset.

Agreed. custom models could be a hit or a miss.