Hacker News new | ask | show | jobs
by Nouser76 838 days ago
Is there any framework/system that distributes the work across multiple GPUs on different computers over a network (LAN or WAN)? I'm not concerned much about latency or generation time, but would love to train or load up huge models and send jobs to run overnight.
1 comments

Yes, you can run FSDP/QLoRA over multi-node. There's a slurm script in the repo showing how to do it.