| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _ktx2 1434 days ago

One of the things I dislike the most about Internet Archive is their relatively open attitude of flaunting privacy on purpose: http://blog.archive.org/2017/04/17/robots-txt-meant-for-sear...

Do you know what system they replaced robots.txt with? Email, one that is filed as a DMCA request: https://medium.com/wednesday-genius/how-to-remove-your-websi... https://jonathanwthomas.net/how-to-get-your-website-out-of-t...

Sometimes, it's probably good to not push the envelope without trying to establish consensus in good faith first.

2 comments

rasz 1434 days ago

Isnt that a response to companies buying old unused domains, slapping robots on it and thus killing whole archive of this domain going back 20 years?

link

_ktx2 1434 days ago

Could be! However, making a direct attack on individual privacy should never have been an option. To make matters worse, the logic of, "We did this to government and military websites, so now we're going to roll it out everywhere" was quite broken for the time and remains so.

There's examples of how this works in a healthy way. Martin Manley is one scenario that comes to mind, where he overtly opted-in to having an archive stored about him upon his death: https://martin-manley.eprci.com/

link

gojomo 1434 days ago

Neiher 'flaunting privacy' nor 'direct attack on individual privacy' are fair descriptions of any of the Archive's web collection policies.

People who freely publish information, to the worldwide public, on the 'World Wide Web' should reasonably expect all sorts of entities to collect, save, analyze, & repurpose that info, unless they take specific steps to discourage such access & use.

The Archive's crawlers identify themselves, and collect things that are publicly linked, or specifically nominated-for-collection by library patrons or partners. Except in some focused specialized collection projects, they don't "log in" as any user, only visiting & collecting what's published freely to any anonymous person/organization/process.

For material needing more privacy, websites always have the option to block any and all unwanted visitors/crawlers with a wide variety of standard techniques, like requiring logins or simple challenges that automated crawlers won't pass.

And, as your linked articles report, the process for a later exclusion by request is pretty quick and simple. (The 2nd post concludes: "So, hats off to the Internet Archive for making the process smooth and relatively painless.") And, such exclusion does not require any sort of "DMCA request".

link

stubish 1433 days ago

This is victim blaming. In my jurisdiction, you retain copyright under any information you publish, even to the worldwide public. This means I can reasonably expect entities to collect, save, analyze and repurpose that info within reason, and without specific steps to discourage access & use. This is why there are laws such as 'fair use' and 'satire', because we wanted to extend what is considered reasonable use of public works. But redistributing copyrighted works without permission? Legally actionable, if you have the money and lawyers and access to the necessary courts. If this was software, such as free software license violations, people in this forum would be calling for the lawyers to nuke them from orbit.

Thankfully DMCA should make the removal process easier now, especially in situations where control over the domain has been lost or being hosted by a third party. Although last I saw there were still artificial barriers, such as needing to list every single individual page needing to be taken down. But this is after the fact, after you discovered your reasonable expectations and privacy have been violated. And then you have to track down the other copies that IA illegally distributed your now-private and copyrighted information to, such as a few libraries around the world with similar projects.

link

gojomo 1433 days ago

I'm talking about the unfair allegation of privacy violations, here.

Note that when the Archive shares crawled content with other libraries, those other libraries often have their own legal right to collect, preserve, and make-available that data even stronger than the Archive's rights via fair use, implied-license, library privileges, and other grounds. For example, many of the Archive's partners in government libraries, archives, & educational institutions have a statutory right & mission to collect copies of everything 'published', including via the world-wide-web, in their sphere of national interest.

As to what some unstated jurisdiction might consider "within reason", I prefer to think they'll find what's reasonable what I find to be reasonable – the IA's crawling policies – unless & until some actual governing authority finds otherwise in a clearly applicable/legible decision.

See my root post (ggggggp): in a vital, evolutionary, true-law-made-on-the-ground civilization, what actually winds up as "within reason" depends on the real implementations & multi-decade demonstrations of how things can beneficially work, as much or more than any copyright loyalist's strict reading of older statutory laws.

link

stubish 1432 days ago

Crawling and archiving everything, including personal writings, is a chilling effect. It is the same situation people are seeing with social media, where the past remains to haunt the present and none of our future leaders are using it without a mask. It was most surprising to people when some Libraries decided 'published' meant anything put on the WWW or posted to Usenet. It seemed grasp for funding and to keep relevant in an age where information was moving out of published media and into opinions virtually scrawled on a toilet door. The stuff I needed to get removed from the Australian National Library's archive is exactly the sort of stuff that shouldn't be in there, directly against the statutory rights and mission, and the sort of thing that could be pointed to when you wanted to defund the project. Because some twit thought meaningful Australian published materials meant anything under a .au top level domain, all the dross hoovered up by IA including all the stuff since removed because it is in nobodies interest or causing harm. And it was a pain in the arse.

link

account42 1433 days ago

You are arguing about copyright in a thread discussing accusations of privacy violations.

link

_ktx2 1433 days ago

There is an overlap in the two. Copyright can be used as a defense against folk who believe, "Everything on the internet not behind authentication is commons". Often these folks point to books, magazines, etc in reference to their argument, which is certainly bad faith, but that's why copyright arguments come up.

A reference to one such comment in this thread: https://news.ycombinator.com/item?id=32150193

link

stubish 1432 days ago

Copyright is a mechanism used to protect privacy in these situations. When you don't have copyright, you are stuck needing a court to protect your privacy. Copyright is also what is required to prove in order to get stuff taken down by IA when the content is not obviously illegal or personally identifiable information (or at least it was when I last needed to deal with it).

link

daniel_reetz 1434 days ago

With respect, I fail to see how a public website is a privacy matter.

link

stubish 1433 days ago

Information on a public website is public until it is taken down or the information changed. The Internet Archive removes an individuals control over when the information remains public. This is privacy. We might be caught naked, and we can't unsee what has been seen, but it is a basic human instinct to draw the curtains and contain further damage. Perfectly innocent individuals suffer because the IA rules are designed around edge cases where public figures try to hide misdeeds.

link

account42 1433 days ago

If you print a magazine you also don't get to recall all copies if you change your mind about something. Giving individuals this kind of control over other's ability to freely share information is dangerous because it is easily abused to hide information that is in the public's interest and that is not an edge case at all. Making a decision to publish something on the public web is hardly analogous to being caught naked even if you may come to regret either.

If anything, the IA should be more reluctant to remove information without a court decision.

link

bakugo 1433 days ago

> The Internet Archive removes an individuals control over when the information remains public.

And that's a good thing in the vast majority of cases. Unless we're talking about sensitive information that was published without the consent of the person in question, all public information should remain public forever.

link

stubish 1432 days ago

In my experience, it is the vast minority of cases. Most of the content of the IA is not in the public interest, now or in the future. It is crap. It is noise. It is the contents of the Internet at a point in time. Actual information is the wheat in the chaff, and why you need search engines to find it. We know this, because of the Usenet archives that are intermittently available. Almost completely useless apart from people having a giggle at how the Internet used to be, a quick browse and search for naughty words. And a few gems in the mountain of noise, in such dire need of curation people hardly know it exists and barely justifiable enough for libraries to keep it alive.

link

Kye 1433 days ago

Some people discover much too late that there are some things they wish they could take back. Often before trying to get a better job or when trying to escape an abuser. Given the ramping up of attacks (legal and otherwise) on queer people, this is going to be a huge issue over the next decade or so.

link

tornato7 1433 days ago

If you’re relying on an honor system .txt file to preserve your privacy I think that says enough already. It’s not like they’re infiltrating password-protected links or private iCloud accounts.

link

_ktx2 1433 days ago

You are right, regulation is what solves this for good

link

gojomo 1433 days ago

In truth, regulation can only reliably protect your privacy from well-behaved actors whose actions/violations are observable.

If you've taken no self-help measures to limit access, then bad actors, unobservable to you and regulators, will still be doing whatever they would like to do and can get away with.

But you may be lulled into a false sense of security by the false promise of a 'solution' via regulations.

link

_ktx2 1433 days ago

As I've now learned, you used to work for the Internet Archive. You should probably start your statements with that.

> If you've taken no self-help measures to limit access...

robots.txt was a nice self-help measure.

> ... then bad actors, unobservable to you and regulators, will still be doing whatever they would like to do and can get away with.

Regulators still have to follow regulations. You are right that I can't stop someone from creating offline archives - but they're not really who I am worried about. Nor am I worried about the small servers that keep copies of documents during transmission, unless of course they're doing so for criminal reasons.

link

gojomo 1432 days ago

Should you start all of your statements with a list of every project you've ever worked on? Show me an example of how it's done before you make such an exceptional request of me.

For any who are more curious about a commenters' background than their current words, my profile already links to copious resources on my work history, & writings, beyond what's typical of contributors here.

link

_ktx2 1432 days ago

Yes, if I was commenting on a project or company that I had self-interest in (eg: reputational, monetary, etc) then I would add a disclaimer like everyone else here does.

If you're writing messaging systems and working on cryptocurrency you can figure out Algolia: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

You didn't need me to do that.

link