Hacker News new | ask | show | jobs
by tehwhale 2523 days ago
While this looks good, I don't think it's feasible for a web crawler in most cases. Crawlers want to crawl a ton of URLs and it would have to make a request to your service for each and every URL.

What's the plan here? Check for a sitemap.xml (which generally only contains crawlable URLs anyway) or crawl the index and look for all links and send a request to your service for every URL before crawling it?

I personally think it would be better suited as a library where you can pass it a robots.txt and it'll let you know if you can crawl a URL based on that.

1 comments

I implemented this service thinking in make a network request for each new URL that needs to be crawled. Internally the service caches all requests by the base domain and user agent. The responses are very fast if these domain was previously checked.

For example if you want to check the url https://example.com/test/user/1 with a user agent MyUserAgentBot, the first request can be slow (~730ms) but subsequent requests with different paths but same base url, port and protocol, will use the cached version (just ~190ms). Note that this version is in alpha and many things can be optimized. The balance between managing these files in different projects or the time between network requests must be sought.

Anyway, any person can compile the parser module and create a library to check robots.txt rules by itself ;-)

PS: thanks for the feedback