Hacker News new | ask | show | jobs
by tommccabe 4230 days ago
Looks like I'm at one of the retailers you crawl. Recently, our site was getting hit with a web crawler that was following links incorrectly. I black listed several IP addresses from accessing the site and now I wonder if it was this!

Does your crawler obey robots.txt rules?

2 comments

Probably they don't, because so much of the web has robots files like

User-Agent: established_company

Allow: /some-stuff

User-Agent: *

Disallow: /

# keeps out filthy peasants

And you're either stuck following them, and not having data that would be offered up for free if you were someone else, or being a bad person and ignoring it. You don't really see the services that follow the rules.

Also, good paper on how much being on robots.txt preferred helps, which makes you a better product, which makes you more preferred...

https://etda.libraries.psu.edu/paper/9230/4516

Tom, that most likely wasn't us.

We don't spider retailer websites. That means we don't follow links or go hardcore on building a database of products.

We hit your website:

* if someone has asked us information about a product url

* when we place an order

* weekly for regression tests

Ping us on contact@ and we're more than happy to jump on a call and describe exactly what we're doing. Most of the time we're completely un-noticeable except for the fact that you're getting more orders.

We know for sure nobody is spidering through us.

thanks for the confirmation. how do you get the product URL in the first place?
That's the app developer's responsibility.
Is there any caching that goes on here? I would presume not as the vendors prices could change prices / details at any point.