| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mirandrom 192 days ago
	I went down a rabbit hole and found most of the missing lists on Common Crawl: https://mirandrom.github.io/bourdain-lists/ Unfortunately, AFAICT, the embedded image data were not included in the Common Crawl scrapes, and a few of the image URLs I tried don't seem indexed by Common Crawl. I only just started playing around with these tools so I might've missed something.

1 comments

ccgreg 191 days ago

Common Crawl is a text-only crawl.

link

mirandrom 191 days ago

I'm not so sure, they say "The crawled content is dominated by HTML pages and contains only a small percentage of other document formats." https://commoncrawl.github.io/cc-crawl-statistics/plots/mime...

In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict.

link

ccgreg 190 days ago

Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired.

link

mirandrom 185 days ago

Can't argue with those credentials. Thanks for confirming/clarifying!

link