|
|
|
|
|
by londons_explore
1100 days ago
|
|
They probably trained all 8 experts on the same data. The experts may have become good at different topics, but no human divided up the topics. The output isn't just the best of the 8 experts - it is a blend of the opinions of the experts. Another (usually smaller) neural net decides how to blend together the outputs of the networks, probably on a per-token basis (ie. for each individual word (ie. token), the outputs of all the experts is consulted, and then blended together, and a word picked (sampled), before moving onto the next word) |
|