You should check out Hivemind[1]. It is very similar to what you described except it used MoE for "fragmentation". They have a couple of examples of pre-training in their repo. Hivemind was also used to build Petals[2] but it only supports fine-tuning and inference[3] afaik.