|
|
|
|
|
by sidkshatriya
505 days ago
|
|
There is a hyperparameter `s` in scalable softmax. SSoftMax_i = exp(s log(n) z_i) / sum (n is length of embedding). Normal softmax (with temperature) SoftMax_i = exp(z_i / T) / sum (T is the temperature). Here Temperature is a hyperparameter. Having a temperature as hyperparameter does not seem too different to me than having `s` as a hyperparameter. I personally don't understand the benefits of SSoftMax. During hyperparameter search you would find the optimal `s` as you might find the optimal temperature `T`. |
|