| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fpgaminer 990 days ago
	That was the first time I'd read about it on HN, but as pointed out on that HN post it wasn't the first time Softmax + 1 was proposed. And, AFAIK, it has never resulted in better performance in practice. Maybe Softmax + 1 works better for fiddling with the attention window after training, but I don't know if anyone has tested that at scale.