Hacker News new | ask | show | jobs
by nmitchko 1259 days ago
You can split the model across devices with huggingface accelerate library.

Check out the infer_auto_memory_map metho which will optimize the model for your configuration (multi gpu, ram, nvme) and then run dispatch model on with that memory map.