| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fooock 2522 days ago

I implemented this service thinking in make a network request for each new URL that needs to be crawled. Internally the service caches all requests by the base domain and user agent. The responses are very fast if these domain was previously checked.

For example if you want to check the url https://example.com/test/user/1 with a user agent MyUserAgentBot, the first request can be slow (~730ms) but subsequent requests with different paths but same base url, port and protocol, will use the cached version (just ~190ms). Note that this version is in alpha and many things can be optimized. The balance between managing these files in different projects or the time between network requests must be sought.

Anyway, any person can compile the parser module and create a library to check robots.txt rules by itself ;-)

PS: thanks for the feedback