| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mirandrom 234 days ago
	I'm not so sure, they say "The crawled content is dominated by HTML pages and contains only a small percentage of other document formats." https://commoncrawl.github.io/cc-crawl-statistics/plots/mime... In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict.

1 comments

ccgreg 234 days ago

Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired.

link

mirandrom 229 days ago

Can't argue with those credentials. Thanks for confirming/clarifying!

link