Hacker News new | ask | show | jobs
by furyofantares 910 days ago
> Synthetic data has many advantages - it is free of copyright issues, the downstream models can't possibly violate copyright if they never saw the copyrighted works to begin with.

I feel like we don't know if this is true or not. If we decide models trained on copyrighted data aren't fair game, it's possible we'll decide "laundered" data also isn't.

I mean, maybe that's not feasible. And I hope we don't decide training on copyrighted material is bogus anyway. But I don't think we know yet.

But also - you can totally violate copyright of something you never saw.

1 comments

It's a matter of ensuring the synthetic content is different enough from the referenced content. We can filter.
Sure, but what matters for copyright is output, not input. For now.

If we make the (poor, imo) decision to prevent training on copyrighted data, that's a restriction on the training process, not on its result.

And in the world where we're making bad decisions to put legal restrictions on the training process, "can't train on data obtained by models that were trained without these restrictions" seems on the table.