|
Code LLM tools are more defensible when they work only by training on documentation and public domain code of the languages and frameworks. But they tend to have also been trained on code under copyright, without license to do so. Some code LLM tools vendors have had to put in 'safeguards', to try to avoid further embarrassing evidence of outright copying. That doesn't mean that they're not still often passing through obfuscated copying. In real life, when someone is caught plagiarizing, such as copying a paragraph from a published work, and changing some words to fit, it's a career-ending scandal. And other times, in a grayer area, of mechanical mashing up of multiple works, with the intent of "take these multiple copyrighted works, and mechanically combine them to my needs, in a way that's hard to explain to a judge, and forget about all the copyrights, so I can claim copyright". (Hey, if merely saying "it's an app" can smokescreen an illegal taxi service, hotel service, or rental price-fixing, just think how effective a shield even matrix multiplication is.) In software, copying code without license, or being tainted by exposure to code when you're supposed to "cleanroom" it, are both already considered illegal or shady. A lot of the current enthusiasm around generative AI feels a bit like the popularity of media piracy -- many people know, or have a nagging suspicion, that it's against norms or laws, but it's just so appealing, and everyone around you is doing it. It also has the dynamic of the many exploiting the relative few creators, in a way that wasn't part of the social contract or laws, and which the creator doesn't want, but the many can simply take. Especially when the many are armed and cheered on, by tool vendors, many of whom should know exactly what they're doing, but who want to win big from this nonconsensual exploitation of the works of others. And, as the former head of Google recently advised at Stanford, just take it, get big money, and pay lawyers later. |