Hacker News new | ask | show | jobs
by Iv 2113 days ago
It has been several years ago that I discovered that despite a similar philosophy behind law-writing and programming, these worlds are very far apart and every year, even when I thought my cynicism was stable, the gap continues to widen.

This is from the American Bar Association, it is the opposite of a blog post you would end with IANAL. Yet, we read (at the end) that even professional lawyers are "struggling" with understanding why a given accepted practice is ok given the laws and obligation. And this on no small matter: it is about hiding that the plaintiff died before the trial. And it is not clear whether it is ok or not.

I had the "chance" to discuss with lawyers specialized in intellectual property. They did not even understand their subject or the imprecision of law. Every time I try to dig into legal issues surrounding IP I end up with the impression that the difference between a lawyer and a layperson is that the lawyer just is more up to date with what was made up in court recently.

Lawyers are going to help you if you are in a case that has happened tens of times in the past but is going to be clueless in a genuinely new situation. Just as the judge will be.

Recently I read about the licensing issues around deep learning models, the definitions of fair use and derivative works. We are used to IETF standards and IEEE specs. Even RFCs are usually pretty precise in how they define things. Laws are crappy when put to that standard. They are just there to provide arguments in a mud-slinging negotiation.

5 comments

> the lawyer just is more up to date with what was made up in court recently.

...and not even that. I've been told several times by very senior lawyers to ask a senior technical person for advice on thorny software licensing issues. In one case that senior technical person pointed me to a recent decision that basically answered exactly the right question.

IME/IMO, lawyers are vastly over-estimated, vastly over-respected, and have far too much political power relative to other stations in this country. We need to reform our legal system from the bottom up so that it serves us instead of the guild.

> And this on no small matter: it is about hiding that the plaintiff died before the trial. And it is not clear whether it is ok or not.

I think the scenario in question had the complaining witness dying. Is a complaining witness same as plaintiff? I would think the plaintiff in this scenario is the state. Genuine doubt as I'm not a lawyer.

They the same thing with possible procedural differences (what they get called depending on the type of case) in some places https://en.wikipedia.org/wiki/Plaintiff

Legally speaking they are different, but to the layperson they mean the same thing (both are the ones on the opposite side of the defendant).

If you have any more info about the licensing issues around deep learning models, I'd be very interested to read it.

In exchange, here is a link to the Debian Deep Learning Team's Machine Learning policy:

https://salsa.debian.org/deeplearning-team/ml-policy

Well, I'd be happy to have someone to chat with and exchange ideas about it. I am currently digging that rabbit hole that seems to be basically uncharted waters.

I would like to find a way to make true open source deep learning models.

Debian legal newsletter [1] and lwn[2] have interesting takes on the relevance of GPL. To them, putting a trained model under the GPL implicates that you have to open your dataset too, which are the "sources". That seems somehow consensual but I still think it is debatable and could need clarification.

I also dug around the question whether a trained model can actually be copyrightable if the training code and the dataset are free. This is akin to a "compilation" operation that adds no creative input (anyway applying copyright to source code is already a bit of a hack). There is a pretty strong ground to argue that they are similar to "compilation of facts" which come with very little protection.

I am now wondering if open source can actually work for deep learning: if trained models are not copyrightable, open source licenses require strong copyright protection to be implemented. Maybe a DL model is not protected enough for that.

Finally, I am reassured by recent fair use rulings that a model will probably not be considered a derived work of its dataset and that proprietary data can legally be used to produce an unencumbered model but the legal uncertainty still exists.

If you are interested in helping me trying to figure out how to protect crucial models so that the first AGI will be beneficial to all and open sourced, I'd be very happy to have someone poke holes into my ideas.

[1] https://lists.debian.org/debian-legal/2009/05/msg00028.html [2] https://lwn.net/Articles/760142/

The Debian ML policy linked above goes a fair way to making truly open source deep learning models. The biggest problem with the policy is they do not address the economic disparity that means only folks with a lot of money can train a model even if they had all the training software, drivers and source data under a free license etc. Perhaps Debian can get enough donated compute time that we can solve this though.

The products of compilation seem to be copyrightable, otherwise software piracy wouldn't be prosecutable. Perhaps the same would apply to trained models.

Do you have a link to those fair use rulings? Also note that fair use is an American concept and doesn't apply in many countries, some of which have similar but more restricted concepts. Also, I wouldn't consider a model produced under your example as a free model, that would be more of a ToxicCandy model in the Debian ML Policy parlance.

Thanks! It really gave some good insights!
Thanks for the links below, reading these opinions took me two more hours of my time but helped me grind some thoughts!

First a quick answer to your two last questions. Programs and binaries are widely recognized as copyrightable. What I am wondering is whether the action of compiling a program constitutes a contribution worthy of protection and of additional copyright. To give a concrete example, imagine I am a company that uses gcc and big machines to provide compilation as a service. You feed it a BSD-licensed source code. My server returns a binary on which I claim a proprietary copyright. Are you allowed to dismiss it as being just the result of a totally deterministic and automated process and reclaim it as BSD? I would argue yes but it could be a non-obvious court case.

Anyway, I don't think I agree on the comparison between compilation and training.

> Do you have a link to those fair use rulings?

I was thinking about this [1] ruling (Authors Guild, Inc. v. Google, Inc.) in which Google scanned commercial books and used this obviously non-free dataset to provide in-text search mechanisms. I am pretty bitter about the fact that one of the main reason for the favorable outcome (Google won) was that the judge estimated it had an "obvious" usefulness when the ruling finally happened, some 10 years after the scanning started at which point it was certainly not appearing obvious to non-tech people. So Google had to prove a tech while in a legal grayzone, a luxury orgs like Debian may not have.

------------------

Now for the real meat :-)

> The Debian ML policy linked above goes a fair way to making truly open source deep learning models

Actually, I am wondering if they are not a bit blinded by the way the GPL works and if they don't constraint themselves a bit artificially by imaginary legal precedent.

They all seem to assume that a trained model will be recognized as a compiled binary, but I see at least 5 competing comparisons that were proposed and could hold ground legally:

1. Trained models as compiled binary 2. Compilation of facts as proposed here [2]. I find it pretty persuasive even if its author dismisses it for what I think is not a good argument. 3. Rendered 2D image from a 3D model 4. 2D photograph of a real 3D object 5. Training as a copyrightable creative creation [3]

It is understandable that Debian maintainers think about everything in terms of programs and source but I feel they shoehorn a bit that notion in the case of machine learning and may not realize how much more flexible the legal framework actually is.

Admittedly, I am less interested in the consequences of slapping the GPL on a trained model than I am about finding a way to solve the potential problems caused by bad actors in the field, just like FOSS did it for regular software. I am strongly suspecting we may have to write a viral license adapted to ML.

One of my example is how would one go to prevent one's work being used by OpenAI the day they decide to refuse releasing their trained models? Or to prevent helping Google or Facebook gained an even more dominant position by adding data to an already good model?

We benefit a lot from the fact that, right now, there seems to be genuinely good will from wealthy actors to contribute to the research community but it feels to me like a Mexican standoff. What happens when one decides to run off with what is published and secretly improves it for commercial gains?

I must say that I have been happily surprised by how much things are free for use right now, from research, algorithms, frameworks and trained models. We avoided a lot of dystopias, probably through some unsung heroe researchers who imposed openness to their employers upon being hired.

The risk still exists though, as all this openness can be reversed on a whim. Basically, I am wondering how we can put all the chances on our sided that the first AGI will benefit the humanity instead of its owner?

Sorry for the wall of text, but if you are still there and would like to continue that discussion, here is fine, but real time discussion is also fine, you can shoot me a mail at yves.quemener@gmail.com and we can do Hangout or Signal from there.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,.... [2] https://lists.debian.org/debian-devel/2018/07/msg00175.html [3] https://lists.debian.org/debian-devel/2019/05/msg00380.html

One other thing is the economic aspects, I just saw this on HN today:

https://learning-at-home.github.io/ https://news.ycombinator.com/item?id=24370510

There are lots of talks about that aspect on the debian MLs as well. They argue on whether they should have machines to redo the training of models that are considered open-source, if that should be part of the "build" process.

I think it is also worth noticing that we are going that way but there are a lot of actors with a lot of processing power. Notice how, two years after one model breaks records, there are ways to make it run with 1000x less power. We are bruteforcing the problem but I am having doubts that raw power is going to matter a lot in a few years.

Also a cat detectors is pretty usable at 99%, not everybody needs 99.99%.

More than processing power, the real power in distributed training lies in the variety of situations. A thousand users may have a hard time having more computing power than Facebook's TPU farm but it will be easier for them to have a larger dataset.

Also, another back and forth in DebConf 2012 that introduces the problem and presents some real world implications: http://penta.debconf.org/dc12_schedule/events/888.en.html
> the impression that the difference between a lawyer and a layperson is that the lawyer just is more up to date with what was made up in court recently.

In fact, to me that sounds like exactly what you would expect in a common law system.

While the difference in source material can be a wide gap as you described the bigger difference that people commonly point at is ethics.

As a lawyer there are rules that define standards of behaviors and violation thereof can quickly terminate not just employment but the career forever. Software doesn’t have that and the entire idea is utterly foreign. As a case in point all lawyers have a general understanding of the word ethic and how it applies to their profession. I have found, as a long time software developer, that most software developers have no idea what that word means and are quick to make faulty assumptions regarding its application. In the software developers’ defense there is not a lot of reason to accurately understand a thing that has never existed in the first place.

> As a lawyer there are rules that define standards of behaviors and violation thereof can quickly terminate not just employment but the career forever.

That may be true in theory, but in practice most lawyers will protect other lawyers, and their regulatory bodies do very little to discourage bad actors.

> As a lawyer there are rules that define standards of behaviors and violation thereof can quickly terminate not just employment but the career forever. Software doesn’t have that and the entire idea is utterly foreign.

This is mostly a function of the industry's age, and is not true of other fields of engineering. I'll be astounded if software isn't folded into Professional Engineering by mid-century.

That seems highly optimistic. Most software developers are vehemently opposed to the professionalization of their industry.
Of course they are. Most software developers wouldn't pass a PE exam. Professionalization of software engineering won't be done from the ground up. It will be imposed form the top down.