Hacker News new | ask | show | jobs
by dustyharddrive 785 days ago
They can't be bothered to allow opt-outs (https://news.ycombinator.com/item?id=39771318) or even attempt to check if code they're archiving is freely licensed (https://www.softwareheritage.org/faq/#25_Is_the_code_checked...)

It's like Common Crawl, another non-commerical mass scraping project just benevolently stealing creations for "AI" companies.

2 comments

Their content policy says you can:

> To request the removal of a content from the Software Heritage archive, you must file a formal request containing all of the following informations:

> (...)

> Please send your request by e-mail to takedown@softwareheritage.org

https://www.softwareheritage.org/legal/content-policy/

The HN thread you posted even shows that they contacted the forge's admin week in advance to check for possible concerns.

Those are instructions to send a copyright infringement notice with tons of PII. We should have higher standards for "opt-out" than that, even for non-consensual data vacuums.
We like calling Common Crawl a crawl, not a scraper. Our 17 year old dataset predates the current AI explosion.