Hacker News new | ask | show | jobs
by iamaaditya 3723 days ago
1.

   Model 	Q+I [1]	 Q+I+C [1] 	ATT 1000 	ATT Full 
   ACC. 	0.2678 	0.2939 		0.4838 		0.4651
Where ATT Full represents using all the words in the vocabulary, as you can see it performs worse than "Most frequent 1000 answers".

Source:

   Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R. (2015). ABC- CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv preprint arXiv:1511.05960.
2.

(a) Several early papers about VQA directly adapt the image captioning models to solve the VQA problem [10][11] by generating the answer using a recurrent LSTM network conditioned on the CNN output. But these models’ performance is still limited [10][11]

(b) our own implementation of this model is less accurate on [2] than other baseline models

Above two quotes are from -

   Xu, Huijuan, and Kate Saenko. "Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering." arXiv preprint arXiv:1511.05234(2015).
However, I think my words were sloppy, as I could not find more concrete proof in the literature, but I will revisit them with detail to recollect where I read about RNN generating answers not overachieveing softmax classification over Top K distribution of answers.

Also, I would like to note that, I am not using only "one word answers" as the possible set of answers. It contains few two words, and very few three and four word answers.

Here is the distribution Key == Length of words | Value == Count of answers with those many words

Counter({1: 855, 2: 112, 3: 32, 4: 1})