|
|
|
|
|
by fpgaminer
990 days ago
|
|
That was the first time I'd read about it on HN, but as pointed out on that HN post it wasn't the first time Softmax + 1 was proposed. And, AFAIK, it has never resulted in better performance in practice. Maybe Softmax + 1 works better for fiddling with the attention window after training, but I don't know if anyone has tested that at scale. |
|