Hacker News new | ask | show | jobs
by zaptrem 815 days ago
Transformer LLMs are just a bunch of MLPs (linear layers) where you sometimes multiply/softmax the output in a funny way (attention). In other words, they're arguably more "vanilla deep net" than most architectures (e.g., conv nets).

(There are also positional/token embeddings and normalization but those are a tiny minority of the parameters)

2 comments

So there's no performance gain for quantization enabled by the transformer architecture? It seems very strange that quantization works so well since in most of my experiments, the internal model weights of mlps look random.
Ok, but what does a perceptron look like in 1-bit? Would it be just some simple logic gate, like an OR-gate?
Not my area of expertise but I'd assume it becomes a decision tree or something.

Edit: lol https://news.ycombinator.com/item?id=39868508