| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jrochkind1 616 days ago

CommonCrawl tries to archive the web and share it openly so everyone doesn't have to scrape it themselves.

"Our goal is to democratize the data so that everyone, not just big companies, can do high-quality research and analysis."

Because they share it openly including with those doing AI, they wind up on "AI crawler" lists, which are increasingly used by blocking tools that just "use the AI list", by people who don't like AI, or, quite ironically, people who are trying to prevent the excess traffic that poorly mannered AI crawlers cause. (Common Crawl's crawler is well mannered, uses good user-agent, respects robots.txt including crawl-delay, etc)

https://commoncrawl.org/