Hacker News new | ask | show | jobs
by al_james 4864 days ago
If its being used for competitive price analysis, I wonder if any retail sites will simply block their crawler? I am assuming that they are (correctly) announcing their crawler by its user agent, so could be blocked via robots.txt.
3 comments

Some percentage of online retailers will block their crawler. It's a situation where party A wants party B's data and there is really no reason party B would want party A to have it. It occurs even to small and medium-sized online retailers to try to block pricing crawlers. Yahoo Store devs are irked that you still can't upload a custom robots.txt file to the web root. Forbidding unwanted crawlers from poking around is frequently the reason (applies to stores using the store editor).

Maybe in exchange for the privilege of obtaining pricing data, this service could offer to automatically clean up their product data in exchange. Win-win for everyone.

Yeah, exactly what I was thinking. Once Amazon and big e-tailers figure out they're being crawled for profits or to the detriment of their sales, you can bet they will restrict access and charge a fee. Seems like a good idea, and I'm sure many of us have thought of the same idea, but it's easy for any of the big guys to easily block their crawler whenever they wish. Crowdsourcing works to a certain extent but that's what deal sites are for.

I remember a real estate startup crawling listing prices from many real estate sites for maket analysis. Needless to say, that startup was shutdown quicker then you can you say "doh!"

Not to mention, many ecommerce sites explicitly forbid this sort of thing. I'd be interested to know how they got around it.
We currently get the pricing data via rss feeds, crawling, data dumps and for some cases also crowdsourcing. In the long run, we also hope to establish merchant relationships and get the data directly.

To the original question on crawling - (I had replied to a similar question previously on HN): "Some great advice here on crawling at scale, which has inspired our crawlers a lot : http://news.ycombinator.com/item?id=4367933 Basically it boils down to three things: 1. If the site is slow,crawl slooowly. 2. If you see non-200 http error codes, stop! 3. Obey robots.txt and speed restrictions."