| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by qayxc 1812 days ago

> Just like how people are allowed to read websites, but scraping is often disallowed.

Hosting code on Github explicitly allows this type of usage (scraping) according to their TOS so I have to ask again - why the sudden complains?

Are we still talking about a shortcoming of the ML model, which very occasionally spits out a few lines of copied code or should we include search engines into this, because they do the exact same thing by design?

robots.txt, for example, has a non-binding, purely advisory character as well and Common Crawl [0] (also used for training GPT-3) publishes a dataset that by definition contains GPL'ed code as well, no matter where it's hosted. So is that off-limits now, too?

[0] http://commoncrawl.org