Hacker News new | ask | show | jobs
by ijk 1384 days ago
You are probably very aware of it, but just to highlight the importance of this for people who aren't aware: data duplication degrades the training and makes memorization (and therefore plagiarism, in the technical sense) more likely. For language models, this includes near-similarities, which I'd guess would extend to images.

Quantifying Memorization Across Neural Language Models https://arxiv.org/abs/2202.07646

Deduplicating Training Data Makes Language Models Better https://arxiv.org/abs/2107.06499 https://twitter.com/arankomatsuzaki/status/14154721921003397... https://twitter.com/katherine1ee/status/1415496898241339400