| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ricardo81 590 days ago

A shared index would surely be nice (Common crawl is perhaps an example of one that could be used) but say you had 10 search engines running from it. One decides a page is very important and updates constantly, so should be fetched every 30 minutes. Another search engine decides a page is spam and doesn't need to be recrawled. There's backend choices that affect the shape and crawl directions of the index.

Then things like whether the crawler should render the page (Using the end DOM content rather than the original source), does it do any tokenisation of the content, store other metrics etc, or does that need to be done by the end search engines.

Also there's issues with crawling Reddit, sites behind Cloudflare etc that others have went into more detail on this comment page.