| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fnulp 5129 days ago
	"just ignore robots.txt?" how about "fuck you"? I guess it's high time to make honeypots, tarpits and bans common practice.

4 comments

sams99 5129 days ago

How about reading http://meta.stackoverflow.com/questions/132675/validating-th... before swearing at me.

No, we are not going to ban all the links from GitHub on our site cause shitty WEB CRAWLERS forced GitHub to use a white-list based approach. This is not WEB CRAWLING. It is link validating. We are not crawling in the sense of building a huge tree of links. We are testing that the external links on our sites work. If we are not allowed to test them, why are our users allowed to click them? Are we not committing an even greater crime by allowing these links on our web site?

"The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable."

The convention is a best effort thing, we tried to respect it, but doing so was both AGAINST what the authors of the Robots.txt file at GitHub intended AND the spec is advisory, not an IETF RFC. If it was an RFC then some smart people would review it and turn it into something sane and usable that deals with this exact use case.

And you know what, an RFC would NEVER pass for robots.txt as it is now cause the white-listing potential is anti Internet. Why should Google and Bing be the only parties who are allowed to discover content on the Internet? User agent restrictions are completely evil, wrong and backwards.

Sorry to shatter your imaginary delusion of what you think the Internet is.

link

fnulp 5129 days ago

"before swearing at me."

Followed by shitty" and "web crawlers" in all caps ^^ Someone ate a clown for breakfast I see.

"No, we are not going to ban all the links from GitHub"

What? Slow down there -- why would you care about invalid links? Did you just say that you can't possibly allow users to post links, as long as you don't know they work for automatic crawlers, not just for human visitors? And someone else chimed in saying you give arguments? Heh.

Well, you give an attempt of one, with "we are not crawling in the sense of", and then refute it with the bit you quoted: "web crawlers and other web robots". It's not called webscraper.txt, it's robots.txt period.

So how then would a website determine a rogue user agent? You dress up like the slimy guys, you get the banhammer -- what do you expect? If you care so much about Facebook and Twitter "content" that it is worth it for you to be undistuingishable from attackers, then just cope with it. But don't pout at me, just eat up what you ordered.

And what delusions about the internet? You just beat around the bush and then finish with that strawman? And what is an "imaginary delusion", by the way? The one you imagine I have? Now that's a Freudian slip if I ever saw one ^^

"Completely evil, wrong and backwards"... so... You're entitled to know the validity of links posted on your site, but website owners aren't allowed to care about their resources and who they offer them to? Who's deluded?

link

scott_s 5129 days ago

Slow down there -- why would you care about invalid links?

Because Stack Overflow is a site whose purpose is to answer questions. People may provide links when asking or answering a question, and those links may be important in understanding either the question or the answer. So invalid links degrade the value of the site.

What they're doing is fundamentally different than web crawling. Web crawlers are about discovering content. That means starting at a root and crawling out to see what you can find. One URL can spawn many more URLs to look at. They are starting with a known URL, and seeing if they can visit that URL. They have one URL, and only visit one URL.

link

Aissen 5129 days ago

He explains how and why in the article, and gives arguments. You do none.

The problem with whitelist-only robots.txt is that they favor monopolies and startups are the ones getting the "fuck you". But maybe you don't care about that.

link

tomjen3 5128 days ago

As a webmaster, why would I want bots to go to my site that doesn't bring any (or much) trafic?

link

bobbo3 5129 days ago

Why would your users post the addresses of your honeypots and tarpits to a 3rd party website?

link

fnulp 5129 days ago

What? I simply meant that if some brainiacs think robots.txt can just be disregarded, it's time to make it a minimum requirement of every self-respecting webmaster to make a tarpit (disallowed in robots.txt) and ban any and all bots going there. You would exactly NOT want a human visitor to post, or ever see, such a link. So yeah, it wouldn't even apply to this github thing, but don't tell that other guy about it.

These are supposedly good guys. So my reaction was "You gotta be fucking kidding?! You didn't just say that it's inconvient how some sites use robots.txt, so you just throw it out altogether for your precious little bot and epically important link checking quest. No wait, you did. Oh well then, BYE."

Oh well. I guess this is hack news, not hacker news, my bad :P

link

rsofaer 5128 days ago

The sort of tarpit you're talking about wouldn't even affect this link validator. You really think Stack Exchange should have given up on validating links because Github's robots.txt has:

User-agent: *

Disallow: /

in it?

link

lazugod 5128 days ago

They could ask for Github's permission.

link

JasonPunyon 5128 days ago

Yeah, we did. http://meta.stackoverflow.com/a/132677/6212

link

grecy 5129 days ago

Would link validation be OK if I manually went through and clicked every single link by hand, and used a pen-and-paper tally of which ones worked and which ones didn't?

What's the difference?

link