Hacker News new | ask | show | jobs
by Udik 3660 days ago
There is something that escapes me regarding this very cool neural style transfer technique. One would expect it to need at least three starting images: the one to transform, the one used as a source for the style, and a non-styled version of the source. This last one should give the network hints on how to transform the unstyled version in the styled one. For example, what does a straight line end up being in the style? Or how is a colour gradient represented? Missing this, it seems that the neural network should be able to recognize objects in the styled picture, and derive the transformation applied based on a previous knowledge of how they would normally look like. But of course the NN is not advanced enough to do that. Can someone explain me roughly how does this work?
4 comments

Disclaimer: I'm probably wrong about this, this is just how I believe "Neural style transfer" works. I never tried this out and there's probably a lot of problems with my explanation.

I believe that this is done using Restricted Boltzman Machines[1] trained with the stylised image.

Think of it as a network that receives an image on the input layer, sends it to one or more hidden layers with less nodes (like an auto-encoder), and then tries to reconstruct the image on the output nodes. This is like a lossy compressor-decompressor overfitted to the stylized image.

Now, just pass the real image as an input to your network and the output should be a stylized version of the input.

[1]http://deeplearning4j.org/restrictedboltzmannmachine

It's right that it needs additional information to distinguish style from content, but they get that from selected layers from established, pre-trained neural nets for image recognition. I don't entirely understand why it works myself, but it seems to.
From the way you described it, you could consider the pre-trained network to be the "missing image". It already has an idea of what images should look like so when it detects an object the "style" is what makes that object different than the stereotypical one it's already modeled.
Right. But it's more complicated than that, the choice of layer(s) to use matters a lot, and I have no idea why they do as they do. Seems it's a bit of dark magic to get it to work well - takes lot of aesthetic judgements too.

I think Alex J. Champandard's implementation is probably the best one out there right now. It has a ton of knobs to twist and is very fast.

I'm learning about ANNs at the moment, so I'm not really getting this but I'd like to.

Aside from the 'missing image' part, what is the fitness example here for training? How does the training process determine what a good image is given there aren't many (any?) examples of a source image -> picasso mapping?

You need a neural network that can understand the content and the style separately. It needs to interpret the content out of the frame and then make a new frame with that content but in a new style.