|
|
|
|
|
by mrfinn
1098 days ago
|
|
So in layman's terms does this mean that on top of big base of knowledge (?) they trained 8 different 220B models and each model specialized in different areas, in practice like an 8 units "brain"?
PS. Thinking now how human brain does something similar as our brain is split in two parts and each one specialize in different tasks. |
|
There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows?
[1] https://transformer-circuits.pub/2022/toy_model/index.html