| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by grungegun 815 days ago
	Does anyone know if this works on vanilla deep networks? These quantization articles always seem to target LLM's which leads me to wonder if there's something special about the LLM architecture vs a vanilla deep architecture.

3 comments

zaptrem 815 days ago

Transformer LLMs are just a bunch of MLPs (linear layers) where you sometimes multiply/softmax the output in a funny way (attention). In other words, they're arguably more "vanilla deep net" than most architectures (e.g., conv nets).

(There are also positional/token embeddings and normalization but those are a tiny minority of the parameters)

link

grungegun 815 days ago

So there's no performance gain for quantization enabled by the transformer architecture? It seems very strange that quantization works so well since in most of my experiments, the internal model weights of mlps look random.

link

amelius 815 days ago

Ok, but what does a perceptron look like in 1-bit? Would it be just some simple logic gate, like an OR-gate?

link

zaptrem 815 days ago

Not my area of expertise but I'd assume it becomes a decision tree or something.

Edit: lol https://news.ycombinator.com/item?id=39868508

link

alephxyz 815 days ago

LLMs have been trending towards obscenely large number of parameters (314B for grok), which makes quantization crucial if you want to run them without a Meta-sized budget.

link

Y_Y 815 days ago

Certainly does, people have been doing this in computer vision for years.

link