| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sams99 5129 days ago

How about reading http://meta.stackoverflow.com/questions/132675/validating-th... before swearing at me.

No, we are not going to ban all the links from GitHub on our site cause shitty WEB CRAWLERS forced GitHub to use a white-list based approach. This is not WEB CRAWLING. It is link validating. We are not crawling in the sense of building a huge tree of links. We are testing that the external links on our sites work. If we are not allowed to test them, why are our users allowed to click them? Are we not committing an even greater crime by allowing these links on our web site?

"The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable."

The convention is a best effort thing, we tried to respect it, but doing so was both AGAINST what the authors of the Robots.txt file at GitHub intended AND the spec is advisory, not an IETF RFC. If it was an RFC then some smart people would review it and turn it into something sane and usable that deals with this exact use case.

And you know what, an RFC would NEVER pass for robots.txt as it is now cause the white-listing potential is anti Internet. Why should Google and Bing be the only parties who are allowed to discover content on the Internet? User agent restrictions are completely evil, wrong and backwards.

Sorry to shatter your imaginary delusion of what you think the Internet is.

1 comments

fnulp 5129 days ago

"before swearing at me."

Followed by shitty" and "web crawlers" in all caps ^^ Someone ate a clown for breakfast I see.

"No, we are not going to ban all the links from GitHub"

What? Slow down there -- why would you care about invalid links? Did you just say that you can't possibly allow users to post links, as long as you don't know they work for automatic crawlers, not just for human visitors? And someone else chimed in saying you give arguments? Heh.

Well, you give an attempt of one, with "we are not crawling in the sense of", and then refute it with the bit you quoted: "web crawlers and other web robots". It's not called webscraper.txt, it's robots.txt period.

So how then would a website determine a rogue user agent? You dress up like the slimy guys, you get the banhammer -- what do you expect? If you care so much about Facebook and Twitter "content" that it is worth it for you to be undistuingishable from attackers, then just cope with it. But don't pout at me, just eat up what you ordered.

And what delusions about the internet? You just beat around the bush and then finish with that strawman? And what is an "imaginary delusion", by the way? The one you imagine I have? Now that's a Freudian slip if I ever saw one ^^

"Completely evil, wrong and backwards"... so... You're entitled to know the validity of links posted on your site, but website owners aren't allowed to care about their resources and who they offer them to? Who's deluded?

scott_s 5129 days ago

Slow down there -- why would you care about invalid links?

Because Stack Overflow is a site whose purpose is to answer questions. People may provide links when asking or answering a question, and those links may be important in understanding either the question or the answer. So invalid links degrade the value of the site.

What they're doing is fundamentally different than web crawling. Web crawlers are about discovering content. That means starting at a root and crawling out to see what you can find. One URL can spawn many more URLs to look at. They are starting with a known URL, and seeing if they can visit that URL. They have one URL, and only visit one URL.