Hacker News new | ask | show | jobs
by radq 1098 days ago
Yeah, that's pretty close. It might be more precise to say they trained one big model that includes 8 "expert networks" and a mechanism to route between them, since everything is trained together.

There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows?

[1] https://transformer-circuits.pub/2022/toy_model/index.html

1 comments

Thank you for the explanation, I still have a hard time understanding how transformers work so amazingly well and tech is already quite a few steps over that idea.
Andrej Karpathy's "zero to hero" series [1] was how I learned the fundamentals of this stuff. It's especially useful because he explains the why and provides intuitive explanations instead of just talking about the what and how. Would recommend it if you haven't checked it out already.

[1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...