Hacker News new | ask | show | jobs
by dplavery92 1422 days ago
In the unCLIP/DALL-E 2 paper[0], they train the encoder/decoder with 650M/250M images respectively. The decoder alone has 3.5B parameters, and the combined priors with the encoder/decoder are the in the neighborhood of ~6B parameters. This is large, but small compared to the name-brand "large language models" (GPT3 et. al.)

This means the parameters of the trained model fit in something like 7GB (decoder only, half-precision floats) to 24GB (full model, full-precision). To actually run the model, you will need to store those parameters, as well as the activations for each parameter on each image you are running, in (video) memory. To run the full model on device at inference time (rather than r/w to host between each stage of the model) you would probably want an enterprise cloud/data-center GPU like an NVIDIA A100, especially if running batches of more than one image.

The training set size is ~97TB of imagery. I don't think they've shared exactly how long the model trained for, but the original CLIP dataset announcement used some benchmark GPU training tasks that were 16 GPU-days each. If I were to WAG the training time for their commercial DALL-E 2 model, it'd probably be a couple of weeks of training distributed across a couple hundred GPUs. For better insight into what it takes to train (the different stages/components of) a comparable model, you can look through an open-source effort to replicate DALL-E 2.[2]

[0] https://cdn.openai.com/papers/dall-e-2.pdf [1] https://openai.com/blog/clip/ [2] https://github.com/lucidrains/dalle2-pytorch

2 comments

> This means the parameters of the trained model fit in something like 7GB (decoder only, half-precision floats) to 24GB (full model, full-precision)

> you would probably want an enterprise cloud/data-center GPU like an NVIDIA A100, especially if running batches of more than one image.

That doesn't seem so bad.

looks up price of NVIDIA A100 - $20,000

oh...ok I'll probably just pay for the service then

I know you're half joking here but there are more consumer-affordable versions like the Geforce RTX 3090ti ($1600 for 24GB). It may not do CUDA work as fast as the A100 but it'll be able to run the model.

For the half-precision version at 7GB there are a ton more options (the RTX 3060 has 12GB for example at ~$450).

p4d.24xlarge is only $33/hr! And you get 400 Gbe so it should be quick to load.
Thanks for the really excellent insight and links.

I do hope that the conversation starts to acknowledge the difference between sunk costs and running costs.

Employees, office leases and equiment are all happening, regardless and ongoing.

Training DALL-E 2: very expensive, but done now. A sunk cost where every dollar coming in makes the whole endeavor more profitable.

Operating the trained model: still expensive, but you can chart out exactly how expensive by factoring in hardware and electricity.

I believe that by not explicitly separating these different columns when discussing expense vs profit, we're making it harder than it needs to be to reason about what it actually costs every time someone clicks Generate.