| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by russianGuy83829 734 days ago
	It seems like this can’t run all models, and needs custom ones trained from scratch: “ We introduce two new models: TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B. These models are sparsified versions of Mistral and Mixtral […]. Notbly, our models are trained with just 150B tokens within just 0.1M dollars”. It remains to be seen how good these custom models are.

3 comments

TOMDM 734 days ago

Paper for the sparcified mixtral models

https://arxiv.org/abs/2406.05955

link

cpldcpu 734 days ago

It's just continued pretraining to "heal" the damage caused by switching the activation functions and enforcing sparsity.

Apparently they managed to recover original performance on standardized tests after continuing pretraining with the 150B tokens. There may be some more specialized knowledge lost that was not covered by their dataset.

link

helloericsf 734 days ago

Agreed. custom models could be a hit or a miss.

link