|
|
|
|
|
by paxys
639 days ago
|
|
> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping. |
|
Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.