|
|
|
|
|
by bick_nyers
383 days ago
|
|
Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers. |
|