| HN Mirror

It is already possible to know if a particular image has been used in training (see eg. https://arxiv.org/abs/1809.06396 by the same authors), but this new work also provides a p-value to give you a confidence on the result it gives.

Also notice that being proactive in watermarking the dataset can be desirable in some cases. For example, many datasets have large overlaps in the base images they use (but sometimes different labels), so it can be interesting to know whether a model was trained on "your" version of the dataset.