No, we are not going to ban all the links from GitHub on our site cause shitty WEB CRAWLERS forced GitHub to use a white-list based approach. This is not WEB CRAWLING. It is link validating. We are not crawling in the sense of building a huge tree of links. We are testing that the external links on our sites work. If we are not allowed to test them, why are our users allowed to click them? Are we not committing an even greater crime by allowing these links on our web site?
"The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable."
The convention is a best effort thing, we tried to respect it, but doing so was both AGAINST what the authors of the Robots.txt file at GitHub intended AND the spec is advisory, not an IETF RFC. If it was an RFC then some smart people would review it and turn it into something sane and usable that deals with this exact use case.
And you know what, an RFC would NEVER pass for robots.txt as it is now cause the white-listing potential is anti Internet. Why should Google and Bing be the only parties who are allowed to discover content on the Internet? User agent restrictions are completely evil, wrong and backwards.
Sorry to shatter your imaginary delusion of what you think the Internet is.
Followed by shitty" and "web crawlers" in all caps ^^ Someone ate a clown for breakfast I see.
"No, we are not going to ban all the links from GitHub"
What? Slow down there -- why would you care about invalid links? Did you just say that you can't possibly allow users to post links, as long as you don't know they work for automatic crawlers, not just for human visitors? And someone else chimed in saying you give arguments? Heh.
Well, you give an attempt of one, with "we are not crawling in the sense of", and then refute it with the bit you quoted: "web crawlers and other web robots". It's not called webscraper.txt, it's robots.txt period.
So how then would a website determine a rogue user agent? You dress up like the slimy guys, you get the banhammer -- what do you expect? If you care so much about Facebook and Twitter "content" that it is worth it for you to be undistuingishable from attackers, then just cope with it. But don't pout at me, just eat up what you ordered.
And what delusions about the internet? You just beat around the bush and then finish with that strawman? And what is an "imaginary delusion", by the way? The one you imagine I have? Now that's a Freudian slip if I ever saw one ^^
"Completely evil, wrong and backwards"... so... You're entitled to know the validity of links posted on your site, but website owners aren't allowed to care about their resources and who they offer them to? Who's deluded?
Slow down there -- why would you care about invalid links?
Because Stack Overflow is a site whose purpose is to answer questions. People may provide links when asking or answering a question, and those links may be important in understanding either the question or the answer. So invalid links degrade the value of the site.
What they're doing is fundamentally different than web crawling. Web crawlers are about discovering content. That means starting at a root and crawling out to see what you can find. One URL can spawn many more URLs to look at. They are starting with a known URL, and seeing if they can visit that URL. They have one URL, and only visit one URL.
He explains how and why in the article, and gives arguments. You do none.
The problem with whitelist-only robots.txt is that they favor monopolies and startups are the ones getting the "fuck you". But maybe you don't care about that.
What? I simply meant that if some brainiacs think robots.txt can just be disregarded, it's time to make it a minimum requirement of every self-respecting webmaster to make a tarpit (disallowed in robots.txt) and ban any and all bots going there. You would exactly NOT want a human visitor to post, or ever see, such a link. So yeah, it wouldn't even apply to this github thing, but don't tell that other guy about it.
These are supposedly good guys. So my reaction was "You gotta be fucking kidding?! You didn't just say that it's inconvient how some sites use robots.txt, so you just throw it out altogether for your precious little bot and epically important link checking quest. No wait, you did. Oh well then, BYE."
Oh well. I guess this is hack news, not hacker news, my bad :P
The sort of tarpit you're talking about wouldn't even affect this link validator. You really think Stack Exchange should have given up on validating links because Github's robots.txt has:
Would link validation be OK if I manually went through and clicked every single link by hand, and used a pen-and-paper tally of which ones worked and which ones didn't?
No, we are not going to ban all the links from GitHub on our site cause shitty WEB CRAWLERS forced GitHub to use a white-list based approach. This is not WEB CRAWLING. It is link validating. We are not crawling in the sense of building a huge tree of links. We are testing that the external links on our sites work. If we are not allowed to test them, why are our users allowed to click them? Are we not committing an even greater crime by allowing these links on our web site?
"The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable."
The convention is a best effort thing, we tried to respect it, but doing so was both AGAINST what the authors of the Robots.txt file at GitHub intended AND the spec is advisory, not an IETF RFC. If it was an RFC then some smart people would review it and turn it into something sane and usable that deals with this exact use case.
And you know what, an RFC would NEVER pass for robots.txt as it is now cause the white-listing potential is anti Internet. Why should Google and Bing be the only parties who are allowed to discover content on the Internet? User agent restrictions are completely evil, wrong and backwards.
Sorry to shatter your imaginary delusion of what you think the Internet is.