Hacker News new | ask | show | jobs
by nostril 1003 days ago
Hi all. I would like your help with an ethics problem.

I run a site called The Nose, a safe haven for AI training data. It operates overseas in a region out of reach of DMCAs. (Past info: https://news.ycombinator.com/item?id=37512147)

This was necessary because I felt it was unacceptable for entire datasets to be forced offline by one lawyer.

The ethical problem is that I'm sympathetic with people who want to remove their content from AI training data.

I received an email from the Danish Rights Alliance about Books3: https://pastebin.com/6qw3yMWZ

They point out that this is illegal in Denmark and elsewhere, and threaten to ban thenose.cc from Denmark.

Obviously, the threats are meaningless. But I'm interested in your views on whether we should comply with the request by removing the specific titles they list.

I was thinking of saying "If you say 'please', I will remove the listed titles." There are 109 entries, so it wouldn't be too much hassle to just remove those from the tarball, and it would be amusing to force a lawyer to ask nicely.

For now, I asked for a complete list of the full filenames they want to be removed, along with proof that they represent the listed rightsholders.

I'm more interested in how you feel. It seems reasonable to let people opt out of training. We could formalize this process by setting up a way to do this. We could also just ignore takedown demands.

What do you think?

6 comments

> What do you think?

Suspect this will be unpopular, but..

Information wants (and deserves) to be free. People make sophisticated and convincing arguments for incentivizing creation, they resonate with me but ultimately I just do not agree with them.

I think projects like yours are on the right side of history, but it will take a long while before we collectively agree.

Your ethical dilemma hinges on whether or not you agree with the above.

Could you elaborate a little more on how you came to this belief? I'm interested in the process of deciding whether to agree or disagree. A good way to get better at that is to get perspectives from thoughtful people.
People naturally share useful information and this collaboration is the basis of all human achievement and the mechanism by which human society evolves.

Putting artificial obstacles in the way of sharing useful information is an act against progress and society itself.

For my part im not an absolutist in this (not all information) but i enthusiastically support zlib and libgen because keeping books and papers from those who cant afford it (half the people on the planet!) is, in my view, extremely antisocial.

My take: copyright as a general concept was and is vital to the well-being of a creative society. But:

(a) Copyright law is so badly thought-out that I don't feel bad about breaking it; and

(b) What's happening in ML is nothing less than the next stage in human intellectual evolution, after thousands of years of relative stasis. It will prove far more important than copyright in the long run, and if a choice is forced the path is clear.

I don't have much use for the Roko's Basilisk argument, but I'm loath to take any action that might either hold back progress in this field, or that might make it possible for the technology to be captured and owned by powerful commercial interests. It will be humans, and not machines, who curse us in the future for allowing archaic values and corrupt copyright laws to slow progress down... or for allowing Facebook and Microsoft to control it.

TL,DR: party on.

I offer some insight from Thomas Jefferson, as much of my own thinking on the topic over time has converged with him, and he is, in my opinion, the superior wordsmith of the two of us.

https://historynewsnetwork.org/article/172970

>Jefferson’s cleanest expression of his views on patents came in a weighty letter to Isaac McPherson (13 Aug. 1813) about Oliver Evan’s proposed elevator patent—a string of buckets fixed on a leather strap, for drawing up water. Is Evans’ machine his own, “his invention,” or do others have right of usage? Jefferson wasc oncerned with the machine itself, not its usage. If one person, for instance, received a patent for a knife that points pens, another could not receive a patent for the same knife for pointing pencils.

>Jefferson begins by noting he has seen similar contraptions used by numerous others—“I have used this machine for sowing Benni seed also” and intends to have other bands of buckets in use for corn and wheat—and even notes that such an elevator was in use in Ancient Egypt. He sums, “There is nothing new in these elevators but being strung together on a strap of leather.” If Evans is to be credited with anything new, “it can only extend to the strap,” yet even the leather strap was used similarly by a certain Mr. Martin of Caroline County, Virginia. There is, Jefferson is clear, nothing original in Evans’ machine.

>Jefferson, however, had more to say: many believe that “inventors have a natural and exclusive right to their inventions,” which is “inheritable to their heirs.” Yet it “would be singular to admit a natural and even an hereditary right to inventors.”

>Why? “Whatever, fixed or movable, belongs to all men equally and in common, is the property for the moment of him who occupies it.” Yet when he relinquishes occupation, he relinquishes ownership. It would be strange to think that a person acquiring ownership of some property, thus, has a natural right to it. That would mean that no one has a right to the property after he perishes, and even more absurdly, that no one had a right to that property prior to him having acquired the land. “Stable ownership is the gift of social law,” and not of nature. The argument applies straightforwardly to ideas. Jefferson sums, “It would be curious then,” adds Jefferson, “if an idea, the fugitive fermentation of an individual brain, could, of natural right, be claimed in exclusive and stable property.” The argument for patenting ideas by appealing to nature is untenable.

>Jefferson still has more to say. The analogy has its flaws. Ideas are singular. If there is anything that nature has made “less susceptible than all others of exclusive property, it is the action of the thinking power called an idea.” Each person possesses exclusively any idea so long as it is unshared. Once shared, it belongs to everyone.

>Moreover, an idea shared is fully possessed by all who entertain it. “He who receives an idea from me, receives instruction himself without lessening mine; as he who lights his taper at mine, receives light without darkening me.” The same cannot be said for property shared. It is that power of an idea, to be shared without lessening its density, which makes it a special gift of nature for “the moral and mutual instruction of man.” He sums, “Inventions then cannot, in nature, be a subject of property.”

While I understand he is not looked upon quite as favorably by many nowadays, as to the sense previously quoted, I hold vehemently he has the incontrovertible right of it, and that that which we endure nowadays as being "Intellectual Property" and the framework of legalisms around it, is an aberrant perversion of the right order of things. As himan beings, we are finite, transient creatures. In our conducting of business wherein we have provided to men (or people if you prefer) the benefit of intellectual property, we have also created non-people (legal fictions) that are nevertheless granted the benefit of holding said Intellectual property. These fictions do not die as men do, and benefit greatly, and in ways that are detrimental to the transmission of hard won experience between generations, and furthermore, perpetuates the greatest inequality of all of our time; that in a period wherein the replication of information is free, we still bind others to be ignorant that some that, if not through the virtuous action of innovating, then through acts of business; lay claim to the fruits of the innovators virtue; holding it over a fire, or throwing it in a vault, and decreeing "Humanity, thou shalt not know til my tithe is satisfied.".

In the short time we all have; deep down, I believe it is the right of the thing that all should be spread as far and wide as cans be that the seeds of ideas may find fertile soil in the minds of others in which to bloom, to being about a richer harvest for all.

Apologize for the wall of text. You asked though.

This is exactly what I was hoping for. Thanks.

I wish there was some way for us to keep in touch. There are a few things I was hoping for some thoughts on, and most of the people here don't have emails in their profiles.

salaw4t@ 'at' gmail 'dot' com

'Least until I'm done fighting with my ISP over getting a static IP so my damned email server won't get ignored out of hand by everyone because I'm in a residential dynamic IP block.

Understand why they do it, but Gawd... so annoying.

Just do the right thing. Put yourself in the other party's shoes and consider how you'd feel about approaching this from their side.

Textbook publishers aren't improving their offerings with each iteration. They re-release the same shit with a different cover and charge schools (and taxpayers) a premium for this "service." In some cases, the content they republish was already paid for with taxpayer money. Their business model is exploitative on every level. Fuck them.

A fiction author puts effort into a work of art. They're not forcing sales or doing anything shady; they're just someone trying to make a living selling copies of their art. Respect that and don't play games with them, unless they can't be civil.

This is ridiculous take.

Textbooks aren't publishing what "was already paid for with taxpayer money". By that same logic if I write a book that summarizes all the scientific research in a certain area, then I don't deserve copyright. That makes no sense.

Writing a textbook is no different than writing a piece of fiction. It takes actual work to do, and it's original content.

And if you don't want to buy the latest edition for your class blame the professor. Most of them are too lazy to actually use older editions and save student hundreds of dollars.

This is an excellent point. Thanks. Do you mind if I quote your comment in our official policies?

It's interesting because libgen also provides most fiction titles, but everyone is rooting for them.

For example, one of the books they want taken offline is from 1954, republished in 2008. So in this case they operate closer to the textbook model than the author model.

Thanks; go for it :)

I can't speak to libgen's current fiction policy. Just be cognizant of the human element.

Your last point is good; I meant to add something about dead authors too. Fuck estates for that very reason. Lazy-ass kids should write their own damn novel.

One other question: Is there legal basis in Denmark for asking for proof that they control the copyright for the listed works? DMCAs operate on "good-faith belief, under penalty of perjury" but in international situations it becomes trickier.

If anyone knows of a Danish lawyer I could consult with, or someone versed in international affairs, please let me know. (Or if you care to contribute funding. Hosting costs around $140/mo right now, which isn't free, but paying for consultation is costlier.)

If you're worried about DMCA, you should comply immediately. If you're not obligated to comply with DMCA, why do bother with such question? If there's no process set by law, set yours yourself.

By "overseas" and "outside the reach of DMCA", be careful how you draw the lines. Did you incorporate overseas? How are you separating you personally from your corporation? If you are based in a country that obligates you to follow DMCA and if your corporation is nothing but paper and you're the only person involved, a judge might disconsider the corporation as a mere way for you to escape your local jurisdictional obligations.

We're not worried. Our operation is anonymous, and as long as we don't slip up, we'll be fine. Though saying "don't slip up" is very "draw the rest of the owl"; it's most of the work: https://news.ycombinator.com/item?id=37346620

But we'd like to do the right thing ethically, which is hard to figure out.

Hypothetically, if you were going to set up a process for yourself outside of the law, what criteria would you use?

Disconsidering the law, I'd use the books to train AI. I could use it to train myself, right? What's wrong with training a machine

But if you're distributing the contents of these books, that's another story. You're pirating, not training AIs. It didn't end up well for the guys behind The Pirate Bay, unfortunately. They can find you. If they can't bust you for copyright infringement, they'll just make stuff up until they put you in jail. Especially if you offend their personalities.

Be careful!..

I don't think (read: about 99% sure) that DMCA safe harbor applies to someone serving a repository that they themselves have compiled so there's no sense in a rights holder using that process. They can ask with varying levels of niceness and/or sue.
Just an opinion: "I think this data set is valuable, and I want to keep it available for use in free countries. After some thought, I think the best solution is the one you propose: go ahead and ban me in Denmark."

Shorter version: "Your move, asshole."

But that's just me, and it's easy to talk big when it's not your neck on the line. So I reckon you should go with what you think; you're the one in the firing line if they figure out how to come after you.

Make those files torrents and I’ll make sure your data will transcend into the unenforceable infinities of the internet.
What is the nature of the data in question?

Is it related to people, businesses, facts, creative works?

How was this data created and how did you have access to it?

As far as I can tell, Books3 seems to be training data for language models. I'm not sure how it was created, but it contains a lot of books. I got it by torrenting The Pile after it was forced offline by the Danes.

They're asking to remove 109 books from the dataset, which I can do. But I'm not sure whether to. Once you set aside the question of law, it becomes a matter of ethics, and these questions aren't so easy.

I wouldn't set the law aside. Do philosophy over the ethics if you want, but only after you are sure to have the legal side covered, because this one can ruin your finances and your life in general.

Unless you're based and incorporated in Iran, Iraq or North Korea, your country has signed the Berne Convention and has implemented in law some level of copyright protection that almost certainly makes the distribution of those books illegal.

If you're not taking very careful technical and legal measures to remain anonymous, you can get in serious legal trouble for breaking the law.

What is the upside for you? Companies like Uber, Google, etc break the law all the time. But they profit billions from that and then pay millions in fines and lawyers. What's your game? Are you profiting enough to make sense - financially-wise - to break the law?

Last but not least, I wouldn't play with lawyers' personalities trying to make them "please" you. Respect them, otherwise, they'll do whatever they can to make you regret it. And believe me, they can do a lot against you. These people are evil. Don't cross their paths.

> What's your game? Are you profiting enough to make sense - financially-wise - to break the law?

Not at all. Hosting costs $130/mo, and I feel the sting each month. I'm not sure we'll even get enough donations to cover that, let alone have some kind of profit motive. But we wouldn't want to profit off the works anyway, or else we'd be no better than the corporations.

My game is to help people like you be able to train your own models. If I don't help you, who will? Companies will have the final say in what you're allowed to do on your own hardware, because they control the data. No data, no training.

The hard part is to balance this with doing the right thing. I'd like to figure out the right thing from first principles and by asking thoughtful people like you, rather than from fear of consequences.

As for consequences, we're being careful enough that it seems worth the risk. (You can read more about our precautions at https://news.ycombinator.com/item?id=37346620.) But I agree that staying out of jail is preferable to being in one.