Hacker News new | ask | show | jobs
by sidkshatriya 505 days ago
There is a hyperparameter `s` in scalable softmax.

SSoftMax_i = exp(s log(n) z_i) / sum (n is length of embedding).

Normal softmax (with temperature)

SoftMax_i = exp(z_i / T) / sum (T is the temperature).

Here Temperature is a hyperparameter. Having a temperature as hyperparameter does not seem too different to me than having `s` as a hyperparameter. I personally don't understand the benefits of SSoftMax. During hyperparameter search you would find the optimal `s` as you might find the optimal temperature `T`.