| basically the training works as follows:
Take a color image in RGB. Convert it to LAB. This is an alternative color space where the first channel is a greyscale image, and two channels that represent the color information. In a traditional pixel-space (non latent) diffusion model, you noise all the RGB channels and train a Unet to predict the noise at a given timestep. When colorizing an image, the Unet always "knows" the black and white image (i.e the L channel). This implementation only adds noise to the color channels, while keeping the L channel constant. So to train the model, you need a dataset of colored images. They would be converted to LAB, and the color channels would be noised. You can't train on decolorized images, because the neural network needs to learn how to predict color with a black and white image as context. Without color info, the model can't learn. |
An extreme example:
https://www.cabinetmagazine.org/issues/51/archibald.php
https://www.messynessychic.com/2016/05/05/max-factors-clown-...
Colourising old TV footage can only result in a misrepresentation, because the underlying colour is false to have any kind of usable representation on the medium itself.
And this caricatured example underpins the problem with colourisation: contemporary bias is unavoidable, and can be misleading. Can you take a black and white photo of an African-American woman in the 1930s and accurately colour her skin?
You cannot.