|
1. Model Q+I [1] Q+I+C [1] ATT 1000 ATT Full
ACC. 0.2678 0.2939 0.4838 0.4651
Where ATT Full represents using all the words in the vocabulary, as you
can see it performs worse than "Most frequent 1000 answers".Source: Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R. (2015). ABC- CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv preprint arXiv:1511.05960.
2.(a) Several early papers about VQA directly adapt the image captioning
models to solve the VQA problem [10][11] by generating the answer using
a recurrent LSTM network conditioned on the CNN output. But these
models’ performance is still limited [10][11] (b) our own implementation of this model is less accurate on [2] than other
baseline models Above two quotes are from - Xu, Huijuan, and Kate Saenko. "Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering." arXiv preprint arXiv:1511.05234(2015).
However, I think my words were sloppy, as I could not find more concrete
proof in the literature, but I will revisit them with detail to recollect
where I read about RNN generating answers not overachieveing softmax
classification over Top K distribution of answers.Also, I would like to note that, I am not using only "one word answers" as
the possible set of answers. It contains few two words, and very few three
and four word answers. Here is the distribution
Key == Length of words | Value == Count of answers with those many
words Counter({1: 855, 2: 112, 3: 32, 4: 1}) |