| I don't speak with any authority on this. With word2vec you can do operations like king - man + woman and the nearest vector will be queen. Similarly capital cities will be in a similar relationship to their countries so if you take Rome - Italy + Spain you will be nearest to Madrid. That permits limited reasoning by analogy and there are some analogy benchmark tasks which those word models perform reasonably well on, effectively answering loads of questions like "Madrid is to Spain as Rome is to ____?". This model is more complex and seems to be using a series of techniques to "disentangle" the parts of the network so it's better at separating identity, i.e. this is a white triangle, from the parts of the network concerning pose, i.e. it's rotated 20 degrees clockwise. See this paper[0] for definitions. So it can take an input triple and see what the transformation is that's been done to the first two images, and then apply that transformation, disentangled from the identity of the what the transformation was applied to, to a third image. The vector of the third image, with the transformation applied to it, is then visualised by the decoder network. So the key is to not just combine some vectors otherwise you'd end up with a mess, the key is to disentagle the identity so that the relevant part of the vector associated with the change, such as rotation, can be transferred. I think the link between this kind of operation and word2vec is clearer in the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[1] (submitted for ICLR 2016 it seems[2]) which shows more explicitly the vectors being added and subtracted for images "smiling woman" - "neutral woman" + "neural man" = a series of generated images of a smiling man. Woman is the identity in the first two, by subtracting the identity is effectively removed and the "smile" is largely what remains in the vector to be transferred onto the neutral man identity as a transformation. I suppose the more general point is that we are going to see more and more complex use of various kinds of thought vectors[3] and more and more attempts to try to improve their accuracy. This area is just going to keep exploding I think because the practical applications, such as they showed in this paper for animations, will keep driving work on it in multiple domains. The current pace of research, which has led to[4], and the results in this paper and in [1] are just mind-boggling. [0] http://arxiv.org/abs/1210.5474 sections 1+2 have a good overview, and 4 has related work [1] https://github.com/Newmu/dcgan_code [2] http://arxiv.org/abs/1511.06434 (first submitted in 19th Nov 2015) [3] http://arxiv.org/abs/1506.06726 [4] http://www.arxiv-sanity.com/ |