Quantizing it down to 8 bits seems to be one solution. TensorRT-LLM does this (and I think requires an H100)? exLlama also does this on much lesser hardware.
Honestly I'm not sure how context "sharding" works on multiple GPUs atm. Decent, really long context OSS models like Yi 200K and YARN finetunes are very new.