| HN Mirror

Can confirm. HuggingFace Accelerate's big model feature[1] has some limits, but it does work. I used it to run a 40GB model on a system with just 20GB of free RAM and a 10GB GPU.

All I had to do was prepare the weights in the format Accelerate understands, then load the model with Accelerate. After that, all the rest of the model code worked without any changes.

But it is incredibly slow. A 20 billion parameter model took about a half hour to respond to a prompt and generate 100 tokens. A 175 billion parameter model like Facebook's would probably take hours.

1: https://huggingface.co/docs/accelerate/big_modeling