Hacker News new | ask | show | jobs
by orost 1115 days ago
Quantization isn't (and wasn't) expensive, it's mostly just data shuffling. A good PC will do a 7B model in half a minute, up to a few minutes for a larger model. Quantized models being made available for download is more for the benefit of less technical users who may not be comfortable with the command-line tools, or for people with slow or metered connections who'd much rather download 15GB of data than download 60 only to squish it into 15.
1 comments

The question is whether this step is actually doing the GPTQ optimized quantization, or simple truncation.
This work introduces a new quantization scheme, NF4, for 4-bit NormalFloat, based on previous work on quantile quantization, so it's not a simple truncation, but it's also not a GPTQ-like optimization method. Figure 3 of the paper shows accuracy improvement of NF4 over FP4.