| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by crawlcrawler 2226 days ago
	I built a search engine for this and other, similar purposes. With Crawl Crawler you start out by searching the meta data of a Common Crawl ("CC") crawl. Then you define a sub section of that data collection by designing a query which search result includes your favorite sites. Then you enrich that sub section by linking those meta data documents (that come from CC's WAT repo) to full text extracts or HTML from CC's WET repo or the WWW. Then you set it to recurringly refresh that section. Voila! You have created a search index that includes your preferred sites. https://crawlcrawler.com

1 comments

This is pretty cool. I always wondered why there wasn't a user interface search somewhere for the CommonCrawl data.