| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Bartweiss 2306 days ago

> How am I confident that this is realistic if you literally say its generated?

This is a particularly good question since it's recently been shown that even neural nets trained on real data often pick up substantial, predictable dataset biases.

Practically every single-dataset-trained CNN seems to pick up stylistic quirks in the photos or labels it's trained on. The most visible result is that the CNNs perform better on same-dataset test examples than they do in the wild, sometimes vastly better. More startlingly, it's possible to work backwards from this: the training source of a "finished" CNN can be discerned by looking for certain types of error, and adversarial examples can be predictably constructed based on training source.

Tagged imagesets undoubtedly have stronger and harder-to-remove 'fingerprints' than text data like addresses, but I'd be shocked if the problem was nonexistent for text. My first reaction to "synthetic sensitive user data" for ML is to worry about winding up with systematic errors coming from the generation scheme.

1 comments

LiveTheDream 2306 days ago

cf "radioactive data" to tag datasets and see which downstream models used those datasets for training: https://ai.facebook.com/blog/using-radioactive-data-to-detec...

link