|
|
|
|
|
by zaptrem
815 days ago
|
|
Transformer LLMs are just a bunch of MLPs (linear layers) where you sometimes multiply/softmax the output in a funny way (attention). In other words, they're arguably more "vanilla deep net" than most architectures (e.g., conv nets). (There are also positional/token embeddings and normalization but those are a tiny minority of the parameters) |
|