| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by laumars 3057 days ago

> my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns

There's no way to know which pages are linked publicly without crawling every page for links. So you're right back at square one.

Ultimately if it's on a Internet-facing web server and not hidden behind an IP whitelist or secure login function then you have to assume it is public. All you are arguing is about different degrees of "public" which somewhat misses the real issue of website security.

Some crawlers do deliberately hit random URLs to check how you're handling 404s. Over crawlers are entirely dishonest and will try to find content that wasn't intended to be made public. How are you going to handle them if you're stumped with the Facebook crawlers that you invited onto your site?

> ...combined with the fact that they don't warn you about it

It's pretty obvious behavior in my opinion but maybe they could have been more explicit. However going back to my previous point, no other crawler advertises what it's going to crawl beforehand. So where do you draw the line? Ranting that Google indexed your site? What about visitors buying stuff on your ecommerce package without prior communication requesting access to the site?

You wouldn't ask customers in a bricks-and-mortar store to state their intentions the moment they walked through the shop door so why should every HTTP user agent have to do the same? While web security can be both complex and maddening, responsibility of hardening the site is still yours; not Facebook's.