|
|
|
|
|
by brucethemoose2
1005 days ago
|
|
AFAIK you cannot train 70B on 2x 3090, even with GPTQ/qlora. And the inference is pretty inefficient. Pooling the hardware would achieve much better GPU utilization and (theoretically) faster responses for the host's requests |
|
Here I'm assuming that Petal uses a large number of small, heterogenous nodes like consumer gpus. It might as well be something much simpler.