Hacker News new | ask | show | jobs
by qayxc 1810 days ago
If it's large sections, that can be fixed by either licence attribution or result filtering.

That's at best a technical issue. What way too many people claim, however, is that the machine isn't even allowed to look at GPL'ed code for some reason, while humans are.

I'd like to learn the reasoning behind that.

2 comments

> What way too many people claim, however, is that the machine isn't even allowed to look at GPL'ed code for some reason, while humans are.

Why would those be the same thing? It's a matter of scale. Just like how people are allowed to read websites, but scraping is often disallowed.

> Just like how people are allowed to read websites, but scraping is often disallowed.

Hosting code on Github explicitly allows this type of usage (scraping) according to their TOS so I have to ask again - why the sudden complains?

Are we still talking about a shortcoming of the ML model, which very occasionally spits out a few lines of copied code or should we include search engines into this, because they do the exact same thing by design?

robots.txt, for example, has a non-binding, purely advisory character as well and Common Crawl [0] (also used for training GPT-3) publishes a dataset that by definition contains GPL'ed code as well, no matter where it's hosted. So is that off-limits now, too?

[0] http://commoncrawl.org

I think result-filtering (based on license of search results) is gnarly enough, and likely computationally intensive, so as to break the whole feature. But it would be interesting to see if that can be crafted to fix the shortcomings of the ML model.