Hacker News new | ask | show | jobs
by zsyllepsis 697 days ago
I think the labels could be much, much worse. They could contain straight noise, just completely random text - not even words. They could also contain plausible, factual text which otherwise has no relationship with the text.

I think most commonly image datasets like this consist of images and their captions, with the presumption that the content author had _some_ reason of associating the two. The goal of the model is to learn that association. And with a _lot_ of examples, to learn nuanced representations.

In the third image, for example, we see some kind of text on a material. The caption mentions "Every year he rides for someone we know, touched by cancer". Perhaps the model is fed another example of bicycle races, with similar imagery of racing bibs. Perhaps its fed another of a race that specifically mentions it's a charity ride to raise money for cancer. Perhaps....

You get the idea. Alone, each example provides only vague connections between the image and the caption. But when you have a ton of data it becomes easier to separate noise from a weak signal.