| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by int64 4469 days ago
	If the service is exposed publicly to the web, It can be crawled regardless of whatever guards are in place by the service provider. Browser emulation will be a good start.

1 comments

palakchokshi 4469 days ago

I know we can crawl them. There are no technical issues with crawling them. The issue is I want to respect their TOS. There are multiple ways to circumvent their anti-crawling code but that means any new search engine will have its roots in "shady" tactics of crawling. IMHO crawling should be allowed if the purpose of the crawler is to show results that drive traffic back to the sites and not mashup the content and deliver it as "original content" on the site that crawled these domains. However e.g. Yelp's TOS do not allow for these types of crawlers that essentially drive traffic back to Yelp.

link

argumentum 4469 days ago

> The issue is I want to respect their TOS.

Permitting certain search engines to crawl but not others is anticompetitive and violates the principle of an open web.

> any new search engine will have its roots in "shady" tactics of crawling.

Who cares? If you are successful enough, then you'll negotiate with them later. Don't worry about it now.

Crawl away!

link

palakchokshi 4469 days ago

http://yelp.com/robots.txt

It seems you can contact Yelp and tell them how you plan to use their data and maybe they'll let you crawl their site.

I really want to explore any and all alternative options before I decide to crawl away :)

link

argumentum 4469 days ago

Are you hacker, or aren't you? Dumb rules exist to be broken.

link

palakchokshi 4469 days ago

Agreed but no harm in accessing the vast knowledge of the HN community to exhaust alternatives before tightening my hacker cap and plunging in head first.

link

palakchokshi 4469 days ago

Yelp is going after startups that crawl their site[1]. I don't want to make Yelp the poster site for this because other big sites do this too. [1]http://www.courthousenews.com/2012/01/27/43403.htm

link

AznHisoka 4469 days ago

Agreed. 99% of the time these big companies don't care. Also, use proxies.

if they come knocking on your door to take down the content, then worry about it.

link