Hacker News new | ask | show | jobs
by palakchokshi 4470 days ago
I know we can crawl them. There are no technical issues with crawling them. The issue is I want to respect their TOS. There are multiple ways to circumvent their anti-crawling code but that means any new search engine will have its roots in "shady" tactics of crawling. IMHO crawling should be allowed if the purpose of the crawler is to show results that drive traffic back to the sites and not mashup the content and deliver it as "original content" on the site that crawled these domains. However e.g. Yelp's TOS do not allow for these types of crawlers that essentially drive traffic back to Yelp.
1 comments

> The issue is I want to respect their TOS.

Permitting certain search engines to crawl but not others is anticompetitive and violates the principle of an open web.

> any new search engine will have its roots in "shady" tactics of crawling.

Who cares? If you are successful enough, then you'll negotiate with them later. Don't worry about it now.

Crawl away!

http://yelp.com/robots.txt

It seems you can contact Yelp and tell them how you plan to use their data and maybe they'll let you crawl their site.

I really want to explore any and all alternative options before I decide to crawl away :)

Are you hacker, or aren't you? Dumb rules exist to be broken.
Agreed but no harm in accessing the vast knowledge of the HN community to exhaust alternatives before tightening my hacker cap and plunging in head first.
Yelp is going after startups that crawl their site[1]. I don't want to make Yelp the poster site for this because other big sites do this too. [1]http://www.courthousenews.com/2012/01/27/43403.htm
Agreed. 99% of the time these big companies don't care. Also, use proxies.

if they come knocking on your door to take down the content, then worry about it.