| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cs702 1061 days ago
	Yes, it makes sense to apply this only to the Softmax we use to compute attention. It makes no sense to apply it to the output Softmax, which must compute a probability distribution over the vocabulary.