Hacker News new | ask | show | jobs
by iamaaditya 3727 days ago
Even though this is "Question Answering", it is trained as a classification model. Thus the model will try to come up with one of the top "1000" answers it has seen during the training. This certainly limits the possibility of answers and sometimes returns very weird answer.

It is not for lack of trying that all the top papers in visual question answering end up doing this as a classification task. Results are really poor when it is used as RNN generation, and also extending it more than top 1000 answers does not yield any better results. 87% of the questions in training + validation is within 1000 unique answers.

Latest models have started using more complex form of memory and more tightly integrating the question vectors. One of the top model called DPPNet trains a separate matrix from the question vector (chain of GRUs) to find correspondence on the image filter weights. Their idea is that some question have more relevant areas in the image features. Yet another model DMN+, by Metamind uses dynamic memory network which they build to do language question answers but the extension to images work pretty good.

Surprisingly the models that use visual attention are not the best and I think it is mostly because this kind of model requires even more data and longer training. Just taking 10 different crop of the question image and doing voting of answer beats attention models (based on numbers reported by these papers).

Right now I am working on converting "End to end network" -http://arxiv.org/abs/1503.08895 to this task. I tried working on Neural Turing machine but I could not make it work for this kind of task, but it was mostly because of lack of indepth understanding of NTM.

Any feedback from you guys are welcome.

P.S Thanks fchollet for writing Keras and for this post. Can't wait to try Keras 1.0

4 comments

Thanks for your work on this, I find the VQA task really interesting.

The classification-based approach is definitely the part I find unsatisfying about this task. The problem to me is that it biases the models learned very strongly towards the data that was collected for training and testing.

Has anyone tried outputting a vector from the model, and using cosine to predict the nearest word/phrase/sentence etc? This seems to work for non-visual QA.[1] Training is performed using noise contrastive estimation. I've discussed this idea with the Virginia Tech team, but I haven't had time to try it, and they seemed a little skeptical.[2]

[1] https://cs.umd.edu/~miyyer/qblearn/

[2] https://github.com/VT-vision-lab/VQA_LSTM_CNN/issues/14

hi syllogism

Right now everyone doing this is highly focussed on the competition and trying to beat the numbers. For that purpose certainly they would want to stick to predicting Top K answers.

For e.g see this table

  Model 	Q+I [1]	 Q+I+C [1] 	ATT 1000 	ATT Full

  ACC. 	0.2678 	0.2939 		0.4838 		0.4651
Where ATT full is when using all the words, it performs worse than ATT 1000 (Source)[Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R. (2015). ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv preprint arXiv:1511.05960.]

Once the competition is over (in rough two months), there will be more focus on actual AI part, where generating the answers would be the right thing to do. There are other papers where they use external knowledge base like DBPedia, certainly "answer word" could be picked up from there.

What you have suggested is a very interesting approach, and I am not aware of any paper which has tried that. Certainly quite a few paper have tried to extend NLP QA to Visual QA but with limited success (expect Metamind people). I will certainly keep that in my ideas to try list. I will update you if I get some results.

P.S: Thank you for creating Spacy, I love it and I use it everyday !

> It is not for lack of trying that all the top papers in visual question answering end up doing this as a classification task. Results are really poor when it is used as RNN generation

I'd be curious to know if you have a reference for this. Given that the answers are one word, a word-level RNN language model output should basically be the same thing as a straight 1000-way softmax.

1.

   Model 	Q+I [1]	 Q+I+C [1] 	ATT 1000 	ATT Full 
   ACC. 	0.2678 	0.2939 		0.4838 		0.4651
Where ATT Full represents using all the words in the vocabulary, as you can see it performs worse than "Most frequent 1000 answers".

Source:

   Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R. (2015). ABC- CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv preprint arXiv:1511.05960.
2.

(a) Several early papers about VQA directly adapt the image captioning models to solve the VQA problem [10][11] by generating the answer using a recurrent LSTM network conditioned on the CNN output. But these models’ performance is still limited [10][11]

(b) our own implementation of this model is less accurate on [2] than other baseline models

Above two quotes are from -

   Xu, Huijuan, and Kate Saenko. "Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering." arXiv preprint arXiv:1511.05234(2015).
However, I think my words were sloppy, as I could not find more concrete proof in the literature, but I will revisit them with detail to recollect where I read about RNN generating answers not overachieveing softmax classification over Top K distribution of answers.

Also, I would like to note that, I am not using only "one word answers" as the possible set of answers. It contains few two words, and very few three and four word answers.

Here is the distribution Key == Length of words | Value == Count of answers with those many words

Counter({1: 855, 2: 112, 3: 32, 4: 1})

Does this actually make sense at all? Usually I figure out the meaning of a sentence to form an answer, which requires thinking about more that just the words in the given sentence.
So, I experimented with Skip-thought vectors (as question), to embody the the semantic knowledge about the question but that performed poorly, I mean very poorly. Tbh, I was surprised with that.

Currently the way system works, is highly biased towards the individual words in the question, whenever it sees the word "color" in the question it is going to pick one of the 'colors' as the answer, even if the question is not asking about colors. This is certainly due all the priors from training data.

Comparing Obama with monkeys and bananas is a bit unfortunate considering it's a common racist stereotype.