Hacker News new | ask | show | jobs
by zingelshuher 812 days ago
Similar MoE implementation was on GitHub for a while, since Jan 2024

https://github.com/zxaall/moegpt

1 comments

Oh nice. What's new here would be noisy top-k routing and expert capacity. It also seems to use the nanoGPT base from Andrej Karpathy. Mine is from January as well. Here's the original blog: https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch
It was inspired by Mixtral 8x7B, of course. I think the same approach, soft to hard MoE, can be used in other domains. Like video/image processing. Would be interesting to take it to extreme, like 4 experts out of 100.