| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Piezoid 1462 days ago

I can think of many specialized applications where the versatility is superfluous while the size of the model prohibit inference on the edge.

Do you know if there is available methods for shrinking a fine-tuned derivative of such big models?

Beside generating a specialized corpora using the big model and then train a smaller model on it, is there a more direct way to reduce the matrices dimensions while optimizing for a more specific inference problem? How far can we scale down before the need of a different network topology?

2 comments

f38zf5vdt 1462 days ago

You can quantize the model to 8-bit tensors instead of 16- or 32-bit bfloats. NVidia has dedicated hardware in their latest series of GPUs so that they can do inference with 8-bit quantization quickly, and it yields 1/2-1/4x of the model in memory. There are other tricks that can be used like sparse tensors, which have been applied to language models and can reduce the memory overhead 10-100x.

See also: "From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression"

fishingboy 1462 days ago

As far as I am concerned, there are many ways to compress a model such as quantization, pruning, and knowledge distillation.

By the way, I found a package called BMCook when I browsed the OpenBMB repo, which implements several algorithms and also compares it with other model compression packages. Hope this can help you.