Hacker News new | ask | show | jobs
by erwannmillon 1048 days ago
Btw, I did this in pixel space for simplicity, cool animations, and compute costs. Would be really interesting to do this as an LDM (though of course you can't really do the LAB color space thing, unless you maybe train an AE specifically for that color space. )

I was really interested in how color was represented in latent space and ran some experiments with VQGAN clip. You can actually do a (not great) colorization of an image by encoding it w/ VQGAN, and using a prompt like "a colorful image of a woman".

Would be fun to experiment with if anyone wants to try, would love to see any results if someone wants to build

2 comments

> I did this in pixel space for simplicity, cool animations, and compute costs

A slight nitpick, wouldn't doing diffusion in the latent space be cheaper?

Depends, given the low res, the 3x64x64 pixel space image is smaller than the latents you would get from encoding a higher-res image with models like VQGAN or the stablediff VAE at their native resolutions.

It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.

Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step

Not necessarily if you don’t already have a pretrained autoencoder.
Question, how long did it take to train this model and what hardware did you use?
Took a lot of failed experiments, the model would keep converging to greyscale / sepia images. Think one of the ways I fixed was by adding an greyscale encoder to the arch. Used its output embedding as additional conditioning. Can't remember if I only added it to the Unet input or injected it during various stages of the unet down pass.
Think the final training run was only a couple hours on a Colab V100