Hacker News new | ask | show | jobs
by radarsat1 823 days ago

    Adding scaled unit gaussian noise to the logits
        noise = torch.randn_like(logits)*F.softplus(noise_logits)
        noisy_logits = logits + noise
Question, if you changed this Gaussian normal for Gumbel noise you would get something like Gumbel softmax, right? I'm curious why not use it? Isn't it a usual way to implement differentiable discrete selection? My curiosity is about the effectiveness of Gumbel softmax since I have had some trouble using it in practice so I'm curious why it's not used here and if there are downsides to it compared to other methods. Honestly just adding normal noise like this seems simpler anyway.
1 comments

This is a good point. I'm yet to try it as I've kind of let this project sit for a couple of months and only getting back to it. I went with this because it's simpler but I'm not sure simpler is necessarily better in this case.
Ah ok, I was wondering if there was some theory here that I wasn't aware of but if it's just experimentation no problem ;) good to know in any case!

I find it a bit difficult to find resources describing the properties of various options for this topic of discrete choices and clustering, apart from a few papers & blogs describing the idea.

Question, have you seen the improvement after adding the noise? I mean in practice. Asking because intuition sometimes doesn't work.
Quite honestly not in my experiments. I wanted to do some Bayesian hyperparameter optimization with some discretized options like noise/no-noise and n_expert/top_k but haven't been able to find the time or free time in one of our GPU clusters. I plan on using perplexity as this is not yet instruction fine tuned.