| There are two forms of compression relevant to LLMs: 1. Reduce the number of parameters 2. Reduce the resolution of each parameter (quantization) For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes). Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.” Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before. Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens. Parameter counts = world knowledge, quantization = “smarts.” This is a soft rule of thumb, the difference isn’t very strong. |