| HN Mirror

Depends if you can fit the whole model into vram or not. If you can’t then you need some sort of gpu parallelism, and you need some sort of communication between the different gpus. But maybe that messaging is small enough that it doesn’t majorly slow down inference. I’m not sure.