Hacker News new | ask | show | jobs
by avisoori1x 827 days ago
A from scratch implementation of a sparse mixture of experts language model in a single file of PyTorch. This is inspired by and largely based on Andrej Karpathy's project 'makemore' and borrows a number of re-usable components from that implementation. Just like makemore, makeMoE is also an autoregressive character-level language model but uses the aforementioned sparse mixture of experts architecture. I added Expert Capacity to this implementation to make it more complete
1 comments

    Adding scaled unit gaussian noise to the logits
        noise = torch.randn_like(logits)*F.softplus(noise_logits)
        noisy_logits = logits + noise
Question, if you changed this Gaussian normal for Gumbel noise you would get something like Gumbel softmax, right? I'm curious why not use it? Isn't it a usual way to implement differentiable discrete selection? My curiosity is about the effectiveness of Gumbel softmax since I have had some trouble using it in practice so I'm curious why it's not used here and if there are downsides to it compared to other methods. Honestly just adding normal noise like this seems simpler anyway.
This is a good point. I'm yet to try it as I've kind of let this project sit for a couple of months and only getting back to it. I went with this because it's simpler but I'm not sure simpler is necessarily better in this case.
Ah ok, I was wondering if there was some theory here that I wasn't aware of but if it's just experimentation no problem ;) good to know in any case!

I find it a bit difficult to find resources describing the properties of various options for this topic of discrete choices and clustering, apart from a few papers & blogs describing the idea.

Question, have you seen the improvement after adding the noise? I mean in practice. Asking because intuition sometimes doesn't work.
Quite honestly not in my experiments. I wanted to do some Bayesian hyperparameter optimization with some discretized options like noise/no-noise and n_expert/top_k but haven't been able to find the time or free time in one of our GPU clusters. I plan on using perplexity as this is not yet instruction fine tuned.