I think distillation in the original sense isn't being done anymore but finetuning on outputs from larger models like GPT-4 is a form of distillation (top-1 logit vs all logits and a curated synthetic data instead of the original dataset)
On quantization though its still weird how just the weights are quantized in methods like gptq / int8 while there are other methods which quantize the activations as well. There's also the matter of KV cache still being in original 16bit precision regardless which is also unsolved here. Do you have any thoughts or insights into this?
It’s not clear to me what’s happening on the distillation front. I agree no one is doing it externally, but I suspect that the foundation model companies are doing it internally, performance is just too good.
There’s a bunch of recent work that quantizes the activations as well, like fp8-LM. I think that this will come. Quantization support in PyTorch is pretty experimental right now, so I think we’ll see a lot of improvements as it gets better support.
The KV cache piece is tied to the activations imo- once those start getting quantized effectively, the KV cache will follow.
1) i actually think that’s too high, i bet it’s more like 30%. My logic is that they have to have _some_ margin, but LLMs are too expensive to have typical software margins. Total speculation though.
2) It generally tracks pretty well unless the model is gaming the metric (training on the test set, overfit to the specific source of data, etc). The relative rankings will typically match in both.
3) alas, not with the mild winter North America’s having. They only stop below -5C or so. I am lucky though. The woodpecker stopped attacking my house and started attacking my neighbor’s. Even worse, it used to be a downy woodpecker,and it’s now been replaced by a pileated one (think: Woody).
On quantization though its still weird how just the weights are quantized in methods like gptq / int8 while there are other methods which quantize the activations as well. There's also the matter of KV cache still being in original 16bit precision regardless which is also unsolved here. Do you have any thoughts or insights into this?