Hacker News new | ask | show | jobs
by c1ccccc1 2329 days ago
I'm surprised that it's even necessary to modify the dataset to achieve this. From what I've read, large models will often memorize their training data, and it seems like even with smaller models it should be possible to tell whether or not it was trained with some set of images, simply because the loss will be lower.
2 comments

It is already possible to know if a particular image has been used in training (see eg. https://arxiv.org/abs/1809.06396 by the same authors), but this new work also provides a p-value to give you a confidence on the result it gives.

Also notice that being proactive in watermarking the dataset can be desirable in some cases. For example, many datasets have large overlaps in the base images they use (but sometimes different labels), so it can be interesting to know whether a model was trained on "your" version of the dataset.

Training pipelines tend to perform image transformations before feeding it to the model, which complicates that.