| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sp332 875 days ago
	The extra precision is more useful for training. Once the network is optimized, it's a statistical model and only needs enough precision to make good guesses. In fact, one of the big papers on this also pointed out that you can drop about 40% of the weights completely. I think people generally skip that part because sparse matrix operations are slower, so it doesn’t help here.

1 comments

viraptor 875 days ago

For models with dropped weights, the keyword is "distilled". For example ssd-1b is a 50% size version of Stable Diffusion XL (https://huggingface.co/segmind/SSD-1B)

link

sp332 875 days ago

That’s crazy, I’ve never seen one that dropped whole layers from a pre-trained model. I guess that avoids the sparse matrix math.

link