Hacker News new | ask | show | jobs
by supermatt 698 days ago
I havent trained any LLMs, so please accept my comment with all the naivety with which it is given - but in the "examples of MINT multimodal documents" graphic at the top of the README, it feels to me as though the labeling (for the images on the left) couldn't be much worse? Is this normal for these datasets? How are we able to build such powerful models with such poor quality data?
2 comments

Deep learning is robust to massive label noise [1]

Not to say that data quality does not matter, but these noisy sets are still very useful.

[1] https://arxiv.org/abs/1705.10694

Minor correction: Deep learning using gradient descent is incredibly robust to noise. If you know the mathematics, this also makes sense intuitively: gradients of incorrect labels will generally point in random directions, whereas the "truth" points in a specific direction (and I explicitly mean truth in the sense of what is portrayed consistently as fact in the dataset, not the real world truth). So when you accumulate gradients, you will end up with a net effect that moves weights only towards the consistent answers.

Since gradient descent is by far the most popular algorithm, it's easy to conflate these two things. But there are other approaches that don't treat noise so well.

I think the labels could be much, much worse. They could contain straight noise, just completely random text - not even words. They could also contain plausible, factual text which otherwise has no relationship with the text.

I think most commonly image datasets like this consist of images and their captions, with the presumption that the content author had _some_ reason of associating the two. The goal of the model is to learn that association. And with a _lot_ of examples, to learn nuanced representations.

In the third image, for example, we see some kind of text on a material. The caption mentions "Every year he rides for someone we know, touched by cancer". Perhaps the model is fed another example of bicycle races, with similar imagery of racing bibs. Perhaps its fed another of a race that specifically mentions it's a charity ride to raise money for cancer. Perhaps....

You get the idea. Alone, each example provides only vague connections between the image and the caption. But when you have a ton of data it becomes easier to separate noise from a weak signal.