Hacker News new | ask | show | jobs
by VMG 3057 days ago
Those people should suffer the consequences.

I'm not a fan of facebook in the slightest, but they are crawling websites they were essentially invited to.

1 comments

>websites they were essentially invited to

Using an analytics pixel is _not_ an invitation to crawl a website.

No, merely posting it on a public server was the invitation
It is not. Leaving your door unlocked is not inviting everybody in to take your stuff. You might make it easier for them to break in, but it still is a break in.

Making something available to the public is not the same as going to the google webmaster tools and telling them to index your page.

You can argue it's abuse or illegal or fraud or whatever you want, but here's the thing: how are you going to stop them? Sure, maybe you stop Facebook with a lawsuit... but everyone else is still doing it, even people outside of your legal jurisdiction. They're still going to do it, so it's up to you to stop them with your design. If someone breaks into your website and destroys a user's data or steals their credit card, that user is not going to want to hear "but what they did was against our ToS!"

This isn't your house where there are police patrolling and ready to respond at a moment's notice when they're called. This is the Internet, accessible by almost literally everyone on the planet, and they don't give a shit about your policy. That's why best practices and application security was invented. So use it.

"Hello, I am a HTTP client, can I have /some/super/secret/page?" "200 OK, here it is"

That's your server complying with the request. Whether by intent or by oversight, doesn't matter: the client comes and asks, and your server can refuse. If it complies, well, you told it to. Whether you have merely exposed the page to the public or also shouted its URL from the rooftops, that's completely irrelevant. If it's not supposed to be public, don't make it public.

"Hello, I am a HTTP client, can I have /some/super/secret/page?" "Oh, but you are ^User-agent$=.acebook ? Nope, 403 Forbidden, no data for you." (Or, more generally, "And who are you? 401 Authorize!" - or any other sort of mandatory access control)

Someone viewing a webpage you put online is not at all like someone stealing something you own.
thats like saying "having a public website is an invitation to DOS-attacks"

there are conventions and reasonable expectations, until now I did not expect that a tracking-pixel would be the basis for crawling, so far most crawlers tend to crawl whats publicly linked, not whats potentially publicly reachable if one knows every url there is

Posting a file to a public web server is an implicit invitation for clients (human or automated) to download that file. That's why "secret urls" are universally considered to provide very little security.

There are common conventions (not always followed) around robots.txt and what files to crawl, but I'm not aware of any rules or conventions or standards around URL discovery. Plenty of crawlers attempt to crawl every registered domain name, for example.

"DOS Attack" is sort of a loaded term since it implies malice. Clearly running a web server doesn't mean you invite malicious attacks (though perhaps you should expect them). Some people consider Googlebot to be a DOS attack since it can easily bring poorly designed sites to their knees.

I watch my site's Google index and I can tell you 100% I never gave Google explicit permission to crawl 90% of the pages that show up there.