| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rrobukef 1061 days ago
	I also saw the author distinguished internal versus output softmax. I think he'd apply his modification only to internal softmax and let the external force an output.

1 comments

cs702 1060 days ago

Yes, it makes sense to apply this only to the Softmax we use to compute attention. It makes no sense to apply it to the output Softmax, which must compute a probability distribution over the vocabulary.

link