| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tehsauce 2176 days ago
	2048 neurons per layer isn't really an accurate description, what he means is 2048 dimensional embeddings at each layer. The actual multihead attention layers in a transformer are not just feed forward 2048*2048, but actually have many more parameters. That's why there's 600B total.