| HN Mirror

Ground truth will always be superior on the "does this match the ground truth?" metric, but that's often just a proxy for output quality and the model will be judged differently once deployed (e.g. "do human users like this?")

That's something to be aware of, especially when you're using convenience data of unknown quality to evaluate your model – many research datasets scraped off the internet with little curation and labeled in a rush by low-paid workers contain a lot of SEO garbage and labeling errors.