Hacker News new | ask | show | jobs
by tehsauce 2176 days ago
2048 neurons per layer isn't really an accurate description, what he means is 2048 dimensional embeddings at each layer. The actual multihead attention layers in a transformer are not just feed forward 2048*2048, but actually have many more parameters. That's why there's 600B total.