|
|
|
|
|
by radq
1098 days ago
|
|
Yeah, that's pretty close. It might be more precise to say they trained one big model that includes 8 "expert networks" and a mechanism to route between them, since everything is trained together. There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows? [1] https://transformer-circuits.pub/2022/toy_model/index.html |
|