| Even though this is "Question Answering", it is trained as a classification model. Thus the model will try to come up with one of the top "1000" answers it has seen during the training. This certainly limits the possibility of answers and sometimes returns very weird answer. It is not for lack of trying that all the top papers in visual question answering end up doing this as a classification task. Results are really poor when it is used as RNN generation, and also extending it more than top 1000 answers does not yield any better results. 87% of the questions in training + validation is within 1000 unique answers. Latest models have started using more complex form of memory and more tightly integrating the question vectors. One of the top model called DPPNet trains a separate matrix from the question vector (chain of GRUs) to find correspondence on the image filter weights. Their idea is that some question have more relevant areas in the image features. Yet another model DMN+, by Metamind uses dynamic memory network which they build to do language question answers but the extension to images work pretty good. Surprisingly the models that use visual attention are not the best and I think it is mostly because this kind of model requires even more data and longer training. Just taking 10 different crop of the question image and doing voting of answer beats attention models (based on numbers reported by these papers). Right now I am working on converting "End to end network" -http://arxiv.org/abs/1503.08895 to this task. I tried working on Neural Turing machine but I could not make it work for this kind of task, but it was mostly because of lack of indepth understanding of NTM. Any feedback from you guys are welcome. P.S Thanks fchollet for writing Keras and for this post. Can't wait to try Keras 1.0 |
The classification-based approach is definitely the part I find unsatisfying about this task. The problem to me is that it biases the models learned very strongly towards the data that was collected for training and testing.
Has anyone tried outputting a vector from the model, and using cosine to predict the nearest word/phrase/sentence etc? This seems to work for non-visual QA.[1] Training is performed using noise contrastive estimation. I've discussed this idea with the Virginia Tech team, but I haven't had time to try it, and they seemed a little skeptical.[2]
[1] https://cs.umd.edu/~miyyer/qblearn/
[2] https://github.com/VT-vision-lab/VQA_LSTM_CNN/issues/14