|
|
|
|
|
by tehsauce
2176 days ago
|
|
2048 neurons per layer isn't really an accurate description, what he means is 2048 dimensional embeddings at each layer. The actual multihead attention layers in a transformer are not just feed forward 2048*2048, but actually have many more parameters. That's why there's 600B total. |
|