|
|
|
|
|
by idle_zealot
820 days ago
|
|
I'm tentatively a fan of the high-risk portion of this legislation, but am disappointed that the EU seems to be taking a "training on copyright data is a copyright violation" stance. This basically kills open models. Only the biggest of companies will be able to strike licensing deals on the scale necessary to produce a model familiar with modern human culture. Any model trained only of public domain data will have surprising knowledge gaps, like a person who has never read a book or watched a movie, only read reviews. |
|
On reading the text, I'm not convinced that they actually are. Copyright of the training data is only mentioned once in the act that I can find, here:
> Any use of copyright protected content requires the authorization of the rightholder concerned unless relevant copyright exceptions and limitations apply. Directive (EU) 2019/790 introduced exceptions and limitations allowing reproductions and extractions of works or other subject matter, for the purposes of text and data mining, under certain conditions.
Initially "Any use of copyright protected content requires the authorization of the rightholder concerned" sounds like a strong anti-scraping stance, but then the "unless relevant copyright exceptions and limitations apply" makes it nothing more than a restatement of how copyright works in general. The question is whether any exceptions/limitations do apply, and the fact that they immediately point to the DSM directive's copyright exception for text and data mining implies they see it as sufficient for machine learning datasets.
The "certain conditions" essentially just means following robots.txt if it's for commercial purposes, which all scrapers I'm aware of already do regardless.