| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 1942 days ago

Note that this is just the VAE component as used to help training and generating images, it will not let you create crazy images with natural language as used in the blog post (https://openai.com/blog/dall-e/).

More specifically from that link:

> [...] the image is represented using 1024 tokens with a vocabulary size of 8192.

> The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE1 that we pretrained using a continuous relaxation.

OpenAI also provides the encoder and decoder models and their weights.

However, with the decoder model, it's now possible to say train a text-encoding model to link up to that decoder (training on say an annotated image dataset) to get something close to the DALL-E demo OpenAI posted. Or something even better!

1 comments

indiv0 1942 days ago

Yeah unfortunately OpenAI has only released the weaker resnets and vision transformers they trained.

Some brilliant folks (Ryan Murdock [@advadnoun], Phil Wang [@lucidrains]) have tried to replicate their results with projects like big-sleep [0] with decent results, but even with this improved VAE we're still a ways from DALL-E quality results.

If anyone would like to play with the model check out either the Google Colab [1] (if you wanna run it on Google's cloud) or my site [2] (if you want a simplified UI).

[0]: https://github.com/lucidrains/big-sleep/

[1]: https://colab.research.google.com/drive/1MEWKbm-driRNF8PrU7o...

[2]: https://dank.xyz