| "If you don't choose to share it, it's not used" I am curious where your confidence that this is true, is coming from? Besides lots of GPU's, training data seems the most valuable asset AI companies have. Sounds like strong incentive to me to secretly use it anyway. Who would really know, if the pipelines are set up in a way, if only very few people are aware of this? And if it comes out "oh gosh, one of our employees made a misstake". And they already admitted to train with pirated content. So maybe they learned their lesson .. maybe not, as they are still making money and want to continue to lead the field. |
1. There are good, ethical people working at these companies. If you were going to train on customer data that you had promised not to train on there would be plenty of potential whistleblowers.
2. The risk involved in training on customer data that you are contractually obliged not to train on is higher than the value you can get from that training data.
3. Every AI lab knows that the second it comes out that they trained on paying customer data saying they wouldn't, those paying customers will leave for their competitors (and sue them int the bargain.)
4. Customer data isn't actually that valuable for training! Great models come from carefully curated training data, not from just pasting in anything you can get your hands on.
Fundamentally I don't think AI labs are stupid, and training on paid customer data that they've agreed not to train on is a stupid thing to do.