Hacker News new | ask | show | jobs
by grungegun 815 days ago
Does anyone know if this works on vanilla deep networks? These quantization articles always seem to target LLM's which leads me to wonder if there's something special about the LLM architecture vs a vanilla deep architecture.
3 comments

Transformer LLMs are just a bunch of MLPs (linear layers) where you sometimes multiply/softmax the output in a funny way (attention). In other words, they're arguably more "vanilla deep net" than most architectures (e.g., conv nets).

(There are also positional/token embeddings and normalization but those are a tiny minority of the parameters)

So there's no performance gain for quantization enabled by the transformer architecture? It seems very strange that quantization works so well since in most of my experiments, the internal model weights of mlps look random.
Ok, but what does a perceptron look like in 1-bit? Would it be just some simple logic gate, like an OR-gate?
Not my area of expertise but I'd assume it becomes a decision tree or something.

Edit: lol https://news.ycombinator.com/item?id=39868508

LLMs have been trending towards obscenely large number of parameters (314B for grok), which makes quantization crucial if you want to run them without a Meta-sized budget.
Certainly does, people have been doing this in computer vision for years.