Hacker News new | ask | show | jobs
by Falkon1313 330 days ago
This is kinda amusing.

robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."

It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.

Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.

That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.

Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."

7 comments

It should be noted here that the Evil Bit proposal was an April Fools RFC https://datatracker.ietf.org/doc/html/rfc3514
While we're at it, it should be noted that Do Not Track was not, apparently, a joke.

It's the same as a noreply email, if you can get away with sticking your fingers in your ears and humming when someone is telling you something you don't want to hear, and you have a computer to hide behind, then it's all good.

There should be a law against displaying a cookie consent box to a user who has their Do Not Track header set.
Not all that far-fetched, Global Privacy Control is legally binding in California.

https://en.wikipedia.org/wiki/Global_Privacy_Control

https://news.ycombinator.com/item?id=43377867

How is "Do Not Track" is a joke, but website presenting a button "Do not use cookies" is not? What's the difference?
It is ridiculous, but it is what you get when you have conflicting interests and broken legislation. The rule is that tracking has to be opt-in, so websites do it the way they are more likely to get people to opt in, and it is a cookie banner before you access the content.

Do-not-track is opt-out, not opt-in, and in fact, it is not opt-anything since browsers started to set it to "1" by default without asking. There is no law forcing advertisers to honor that.

I guess it could work the other way: if you set do-not-track to 0 (meaning "do-track"), which no browser does by default, make cookies auto-accept and do not show the banner. But then the law says that it should require no more actions to refuse consent than to consent (to counter those ridiculous "accept or uncheck 100 boxes" popups), so it would mean they would also have to honor do-not-track=1, which they don't want to.

I don't know how legislation could be unbroken. Users don't want ads, don't want tracking, they just want the service they ask for and don't want to pay for it. Service providers want exactly the opposite. Also people need services and services need users. There is no solution that will satisfy everyone.

Labor laws are not set to satisfy everyone, they are set such that a company cannot use it’s outsized power to exploit their workers, and that workers have fair chance at negotiating a fair deal, despite holding less power.

Similarly consumer protection laws—which the cookie banners are—are not set to satisfy everyone, they are set such that companies cannot use their outsized power to exploit their customers. A good consumer protection law will simply ban harmful behavior regardless of whether companies which engage in said harmful behavior want are satisfied with that ban or not. A good consumer protection law, will satisfy the user (or rather the general public) but it may satisfy the companies.

Good consumer protection laws are things like disclosure requirements or anti-tying rules that address information asymmetries or enable rather than restrict customer choice.

Bad consumer protection laws try to pretend that trade offs don't exist. You don't want to see ads, that's fine, but now you either need to self-host that thing or pay someone else money to do it because they're no longer getting money from ads.

There is no point in having an opt in for tracking. If the user can be deprived of something for not opting in (i.e. you can't use the service) then it's useless, and if they can't then the number of people who would purposely opt in is entirely negligible and you ought to stop beating around the bush and do a tracking ban. But don't pretend that's not going to mean less "free stuff".

The problem is legislators are self-serving. They want to be seen doing something without actually forcing the trade off that would annihilate all of these companies, so instead they implement something compromised to claim they've done something even though they haven't actually done any good. Hence obnoxious cookie banners.

> since browsers started to set it to "1" by default without asking

IIRC IE10 did that, to much outcry because it upended the whole DNT being an explicit choice; no other browser (including Edge) set it as a default.

There have been thoughts about using DNT (the technical communication mechanism about consent/objection) in correlation with GDPR (the legal framework to enforce consent/objection compliance)

https://www.w3.org/blog/2018/do-not-track-and-the-gdpr/

The GDPR explicitly mentions objection via technical means:

> In the context of the use of information society services, and notwithstanding Directive 2002/58/EC, the data subject may exercise his or her right to object by automated means using technical specifications.

https://law.stackexchange.com/a/90002

People like to debate as to whether DNT itself has enough meaning:

> Due to the confusion about this header's meaning, it has effectively failed.

https://law.stackexchange.com/a/90004

I myself consider DNT as what it means at face value: I do not want to be tracked, by anyone, ever. I don't know what's "confusing" about that.

The only ones that are "confused" are the ones it would be detrimental to i.e the ones that perform and extract value from the tracking, and make people run in circles with contrived explanations.

It would be perfectly trivial for a browser to pop up a permission request per website like there is for webcams or microphone or notifications, and show no popup should I elect to blanket deny through global setting.

For one, Do Not Track is on the client side and you just hope and pray that the server honors it, whereas cookie consent modals are something built by and placed in the website.

I think you can reasonably assume that if a website went through the trouble of making such a modal (for legal compliance reasons), the functionality works (also for legal compliance reasons). And, you as the client can verify whether it works, and can choose not to store them regardless.

> And, you as the client can verify whether it works

How do you do that? Cookies are typically opaque (encrypted or hashed) bags of bits.

Just the presence or absence of the cookie.
The goal with Do Not Track was legal (get governments to recognize it as the user declining consent for tracking and forbidding additional pop-ups) and not technological.

Unfortunately, the legal part of it failed, even in the EU.

Do Not Track had a chance to get into law, which if it did would be good that the code and standard were already in place.
I like the 128 bit strength indicator for how "evil" something is.
So it did the same work that a sitemap does? Interesting.

Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.

> I didn’t realize its original purpose was to manage duplicate content penalties though.

That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

https://www.robotstxt.org/orig.html

> It was mostly about stopping crawlers from unnecessarily consuming server resources.

Very much so.

Computation was still expensive, and http servers were bad at running cgi scripts (particularly compared to the streamlined amazing things they can be today).

SEO considerations came way way later.

They were also used, and still are, by sites that have good reasons to not want results in search engines. Lots of court files and transcripts, for instance, are hidden behind robots.txt.

> Computation was still expensive

I think this is still relevant today in cases where there are not many resources available: think free tiers, smallest fixed cost/fixed allocation scenarios, etc.

> And you wouldn't expect the bad bots to play nicely just because you asked them.

Well, yes, the point is to tell the bots what you've decided to consider "bad" and will ban them for. So that they can avoid doing that.

Which of course only works to the degree that they're basically honest about who they are or at least incompetent at disguising themselves.

I think it depends on the definition of bad.

I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.

Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions

Well you as the person running a website can define unilaterally what you consider good and bad. You may want bots to crawl everything, nothing or (most likely) something inbetween. Then you judge bots based on those guidelines. You know like a solicitor that rings your bell that has a text above it saying "No solicitors", certain assumptions can be made about those who ignore it.
> That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.

It's usually a bad default to assume incompetence on the part of others, especially when many experienced and knowledgeable people have to be involved to make a thing happen.

The idea behind the DNT header was to back it up with legislation-- and sure you can't catch and prosecute all tracking, but there are limitations on the scale of criminal move fast and break things before someone rats you out. :P

Some people just believe that because someone says so, everyone will nicely obey and follow the rules, don't know maybe it is a cultural thing.
Or a positive belief in human nature.

I admit I'm one of those people. After decades where I should perhaps be a bit more cynical, from time to time I am still shocked or saddened when I see people do things that benefit themselves over others.

But I kinda like having this attitude and expectation. Makes me feel healthier.

I deeply agree with you, and I'd like to add:

Trust by default, also by default, never ignoring suspicious signals.

Trust is not being naïve, I find the confusion of both very worrying.

You don't have to go as far as to straight up "trust by default". You can instead "give a chance" by default, which is the middle path.

Actually Veritasium has a great video about this. It's proven as the most effective strategy in monte carlo simulation.

EDIT: This one: https://youtu.be/mScpHTIi-kM

i like that Veritasium vid a lot, i've watched it a couple times. The thing is, there's no way to retaliate against a crawler ignoring robots.txt. IP bans don't work, user agent bans don't work, there's no human to shame on social media ether. If there's no way to retaliate or provide some kind of meaningful negative feedback then the whole thing breaks down. Back to the Veritasium video, if a crawler defects they reap the reward but there's no way for the content provider to defect so the crawler defects 100% of the time and gets 100% of the defection points. I can't remember when i first read the rfp for robots.txt but I do remember finding it strange that it was a "pretty please" request against a crawler that has a financial incentive to crawl as much as it can. Why even go through the effort to type it out?

EDIT: i thought about it for a min, i think in the olden days a crawler crawling every path through a website could yield an inferior search index. So robots.txt gave search engines a hint on what content was valuable to index. The content provider gained because their SEO was better (and cpu util. lower) and the search engine gained because their index was better. So there was an advantage to cooperation then but with crawlers feeding LLMs that isn't the case.

No robots.txt can't fix this.

Have you tried Anubis? It was all over the internet a few months ago. I wonder if it actually works well. https://github.com/TecharoHQ/anubis

> Trust by default, also by default, never ignoring suspicious signals.

While I absolutely love the intent of this idea, it quickly falls apart when you're dealing with systems where you only get the signals after you've already lost everything of value.

It's easy to believe, though, and most of us do it every day. For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.

There are varying degrees of this through our lives, where the trust lies not in the fact that people will just follow the rules because they are rules, but because the rules set expectations, allowing everyone to (more or less) know what's going on and decide accordingly. This also makes it easier to single out the people who do not think the rules apply to them so we can avoid trusting them (and, probably, avoid them in general).

In Southern Europe, and countries with similar cultures, we don't obey rules because someone says so, we obey them when we see that is actually reasonable to do so, hence my remark regarding culture as I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.

Naturally I am talking about cultures where that decision has not been taken away from their citizens.

> I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.

The problem with that is that most people are not educated enough to judge what makes sense and what doesn’t, and the less educated you are, the more likely you are to believe you know what makes sense when you’re actually wrong. These are exactly the people that should be following the rules blindly, until they actually put in the effort to learn why those rules exist.

I believe there is a difference between education and critical thinking. One may not have a certain level of education, but could exercise a great degree of critical thinking. I think that education can help you understand the context of the problem better. But there are also plenty of people who are not asking the right questions or not asking questions - period - who have lots of education behind them. Ironically, sometimes education is the path that leads to blind trust and lack of challenging the status quo.
> the less educated you are, the more likely you are to believe you know what makes sense

It actually frightens me how true this statement is.

To reinforce my initial position about how important the rules are for setting expectations, I usually use cyclists as an example. Many follow the proposed rules, understanding they are traffic, and right of way is not automagically granted based on the choice of vehicle, having more to do with direction and the flow of said traffic.

But there's always a bad apple, a cyclist who assumes themselves to be exempt from the rules and rides against the flow of traffic, then wonders why they got clipped because a right-turning driver wasn't expecting a vehicle to be coming from the direction traffic is not supposed to come from.

In the end, it's not really about what we drive or how we get around, but whether we are self-aware enough to understand that the rules apply to us, and collectively so. Setting the expectation of what each of our behaviors will be is precisely what creates the safety that comes with following them, and only the dummies seem to be the ones who think they are exempt.

As a French, being passed by the right by Italian drivers on the highway really makes me feel the superiority of Southern Europeans judgment over my puny habit of blindly following rules. Or does it?

But yes, I do the same. I just do not come here to pretend this is virtue.

The rules in France are probably different but passing on the right is legal on Italian highways, in one circumstance: if one keeps driving on the lane on the right and somebody slower happens to be driving on the lane on the left. The rationale is that it normally happens when traffic is packed, so it's ok even if there is little traffic. Everybody keep driving straight and there is no danger.

It's not legal if somebody is following the slower car on the left and steers to the right to pass. However some drivers stick to the left at a speed slower than the limit and if they don't yield what happens is that eventually they get passed on the right.

The two cases have different names. The normal pass is "sorpasso", the other one (passing by not steering) is "superamento", which is odd but they had to find a word for it.

Not sure if it is a virtue, but standing as a pedestrians in an empty street at 3 AM waiting for a traffic light to turn green doesn't make much sense either, it isn't as if a ghost car is coming out of nowhere.

It should be a matter of judgement and not following rules just because.

It makes sense as it allows to walk city streets safely on autopilot while thinking about other things.
> For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.

That trust comes from the knowledge that it's likely that those drivers also don't want to crash, and would rather prefer to get where they're going.

I love the culturally specific implication that 'commute' == 'commute in the car' :)
I apologize for that. I try to mitigate my US-centricness in my comments as much as possible, understanding completely that I am speaking with a global audience, but I am definitely not perfect at it :D

I suppose the same goes if you take the tube, ride a bike, walk, etc? There's still rules in terms of behavior, flow of traffic (even foot traffic), etc, that helps set a number of expectations so everyone can decide and behave accordingly. Happy to hear different thoughts on this!

Robots.txt was created long before Google and before people were thinking about SEO:

https://en.wikipedia.org/wiki/Robots.txt

The scenario I remember was that the underfunded math department had an underpowered server connected via a wide and short pipe to the overfunded CS department and webcrawler experiments would crash the math department's web site repeatedly.

With the advent of AI and the notion of actually going to a website as being quaint: each website should have a humans.txt such as https://www.netflix.com/humans.txt or https://www.google.com/humans.txt
I have not heard of humans.txt before. It is apparently used for acknowledgement and crediting the dev team who created the resource.
What everybody is missing is that AI inference (not training) is a route out of the enshittification economy. One reason why Cloudflare is harassing you all the time to click on traffic lights and motorcycles is to slam the door from some of the exit routes.
Yup. Robots.txt was a don't-swamp-me thing.
It is so interesting to track this technology's origin back to the source. It makes sense that it would come from a background of limited resources where things would break if you overwhelm it. It didn't take much to do so.
I still see the value in robots.txt and DNT as a clear, standardised way of posting a "don't do this" sign that companies could be forced to respect through legal means.

The GDPR requires consent for tracking. DNT is a very clear "I do not consent" statement. It's a very widely known standard in the industry. It would therefore make sense that a court would eventually find companies not respecting it are in breach of the GDPR.

That was a theory at least...

Would robot traffic be considered tracking in light of GDPR standards? As far as I know there are no regulatory rules in relation to enforcing robots behaviors outside of robots.txt, which is more of an honor system.
DNT and GDPR was just an example. In a court case about tracking, DNT could be found to be a clear and explicit opt out. Similarly, in a case about excessive scraping or the use of scraped information, robots txt could be used as a clear and explicit signal that the site operator does not want their pages harvested. It all but certainly gets rid of the "they put it on the public web so we assumed we can scrape it, we can'task everyone for permission" argument. They can't claim it was "in good faith" if there's a widely-accepted standard for opting out.
Fair enough. It should be sufficient to say one way or the other.