|
|
|
|
|
by tehwhale
2523 days ago
|
|
While this looks good, I don't think it's feasible for a web crawler in most cases. Crawlers want to crawl a ton of URLs and it would have to make a request to your service for each and every URL. What's the plan here? Check for a sitemap.xml (which generally only contains crawlable URLs anyway) or crawl the index and look for all links and send a request to your service for every URL before crawling it? I personally think it would be better suited as a library where you can pass it a robots.txt and it'll let you know if you can crawl a URL based on that. |
|
For example if you want to check the url https://example.com/test/user/1 with a user agent MyUserAgentBot, the first request can be slow (~730ms) but subsequent requests with different paths but same base url, port and protocol, will use the cached version (just ~190ms). Note that this version is in alpha and many things can be optimized. The balance between managing these files in different projects or the time between network requests must be sought.
Anyway, any person can compile the parser module and create a library to check robots.txt rules by itself ;-)
PS: thanks for the feedback