Hacker News new | ask | show | jobs
by fpgaminer 990 days ago
That was the first time I'd read about it on HN, but as pointed out on that HN post it wasn't the first time Softmax + 1 was proposed. And, AFAIK, it has never resulted in better performance in practice. Maybe Softmax + 1 works better for fiddling with the attention window after training, but I don't know if anyone has tested that at scale.