| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by orost 1162 days ago
	Quantization isn't (and wasn't) expensive, it's mostly just data shuffling. A good PC will do a 7B model in half a minute, up to a few minutes for a larger model. Quantized models being made available for download is more for the benefit of less technical users who may not be comfortable with the command-line tools, or for people with slow or metered connections who'd much rather download 15GB of data than download 60 only to squish it into 15.

1 comments

sp332 1162 days ago

The question is whether this step is actually doing the GPTQ optimized quantization, or simple truncation.

link

sanxiyn 1162 days ago

This work introduces a new quantization scheme, NF4, for 4-bit NormalFloat, based on previous work on quantile quantization, so it's not a simple truncation, but it's also not a GPTQ-like optimization method. Figure 3 of the paper shows accuracy improvement of NF4 over FP4.

link