In image captioning specifically we are in a dire need of data. To give you an idea, MS COCO is ~100K images. ImageNet Challenge is 1.2 million. The dataset is quite miniscule, which restricts the richness of models we can explore, and forces strong regularization concerns. The place I normally like to be is when my several hundred million parameter nets are underfitting - that's where neural nets really shine - and MS COCO is not that.
Also it's not only the size of the dataset, it's also the size/variety in the label space. ImageNet is quite comprehensive, with many varied labels. MS COCO is quite biased towards a narrow ~hundred classes.
I'd love to see a properly large dataset of images "from the wild", with no restrictions on content (unlike what is done in MS COCO), annotated with sentences. From my experience with adding data to models in these situations I'm quite certain this would work _significantly_ better.
Also it's not only the size of the dataset, it's also the size/variety in the label space. ImageNet is quite comprehensive, with many varied labels. MS COCO is quite biased towards a narrow ~hundred classes.
I'd love to see a properly large dataset of images "from the wild", with no restrictions on content (unlike what is done in MS COCO), annotated with sentences. From my experience with adding data to models in these situations I'm quite certain this would work _significantly_ better.