Honestly I'm not sure how context "sharding" works on multiple GPUs atm. Decent, really long context OSS models like Yi 200K and YARN finetunes are very new.
Honestly I'm not sure how context "sharding" works on multiple GPUs atm. Decent, really long context OSS models like Yi 200K and YARN finetunes are very new.