|
|
|
|
|
by dplavery92
1422 days ago
|
|
In the unCLIP/DALL-E 2 paper[0], they train the encoder/decoder with 650M/250M images respectively. The decoder alone has 3.5B parameters, and the combined priors with the encoder/decoder are the in the neighborhood of ~6B parameters. This is large, but small compared to the name-brand "large language models" (GPT3 et. al.) This means the parameters of the trained model fit in something like 7GB (decoder only, half-precision floats) to 24GB (full model, full-precision). To actually run the model, you will need to store those parameters, as well as the activations for each parameter on each image you are running, in (video) memory. To run the full model on device at inference time (rather than r/w to host between each stage of the model) you would probably want an enterprise cloud/data-center GPU like an NVIDIA A100, especially if running batches of more than one image. The training set size is ~97TB of imagery. I don't think they've shared exactly how long the model trained for, but the original CLIP dataset announcement used some benchmark GPU training tasks that were 16 GPU-days each. If I were to WAG the training time for their commercial DALL-E 2 model, it'd probably be a couple of weeks of training distributed across a couple hundred GPUs. For better insight into what it takes to train (the different stages/components of) a comparable model, you can look through an open-source effort to replicate DALL-E 2.[2] [0] https://cdn.openai.com/papers/dall-e-2.pdf
[1] https://openai.com/blog/clip/
[2] https://github.com/lucidrains/dalle2-pytorch |
|
> you would probably want an enterprise cloud/data-center GPU like an NVIDIA A100, especially if running batches of more than one image.
That doesn't seem so bad.
looks up price of NVIDIA A100 - $20,000
oh...ok I'll probably just pay for the service then