| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by markpapadakis 3589 days ago

In the past we built and operated Greece’s largest search engine(Trinity), and we would crawl/refresh all Greek pages fairly regularly.

If memory serves, the frequency was computed for clusters of pages from the same site, and it depended on how often they were updated(news sites front-pages were in practice different in successive updates, whereas e.g users homepage were not, they rarely were updated), and how resilient the sites were to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than expected to download the page contents than what we expected based on site-wide aggregated metrics, we ‘d adjust the frequency, etc).

The crawlers were all draining multiple queues, whereas URLs from the same site would always end up on the same queue(via consistent hashing, based on the hostname’s hash), so a single crawler process was responsible for throttling requests and respecting robots.txt rules for any single site, without need for cross-crawler state synchronisation.

In practice this worked quite well. Also, this was before Google and its PageRank and social networks (we ‘d probably have also considered pages popularity based on PageRank like metrics and social ‘signals’ in the frequency computation, among other variables).

2 comments

greglindahl 3589 days ago

In the current web, sites like Amazon are so large that you'll need many crawlers. On the plus side, it appears that almost all large sites don't have rate limits.

link

stummjr 3589 days ago

Crawl-delay is not in the standard robots.txt protocol, and according to Wikipedia, some bots have different interpretations for this value. That's why maybe many websites don't even bother defining the rate limits in robots.txt.

link

greglindahl 3589 days ago

I was referring to an actual rate limit, not crawl-delay. For example, YouTube is pretty strict about rate limits:

http://www.bing.com/search?q=%22We+have+been+receiving+a+lar...

I agree that crawl-delay is rare, and often it's set too long so that it's impossible to fully crawl a site -- as if the webmaster set it up 10 years ago and never updated it as their site got faster and bigger.

link

atmosx 3589 days ago

Hi Mark, out of curiosity, which search engine is that?

link

markpapadakis 3589 days ago

It was called Trinity -- it was initially developed for Pathfinder.gr, and soon thereafter was the search provider for in.gr, and was also accessible at trinity.gr for some time.

link