Hacker News new | ask | show | jobs
by kator 12 days ago
> Yet, this shift made me re-evaluate the open source code publishing. Prior to that, I have been positive about free and open software, and considered this to be the default mode for work such as kefir. I did not require any justifications from myself to publish something. Now, however, I feel more and more that the main beneficiaries of my unpaid work are companies scraping the internet to train large language models. Currently accepted status quo in this area goes against my own intentions in licensing this work under GNU GPLv3. Publication has ceased to be the "null hypothesis" for me, and requires explicit mental justification which I am not able to provide.

I feel this pain, one of my small donation driven sites has been destroyed by crawlers who just ignore robots.txt and burn the site into the ground.

Sort of jokingly I proposed an update to the "spam fax" law:

https://www.karlbunch.com/random/website-protection-act/

4 comments

This is essentially the digital world transforming from a high trust society into a low trust one. Sad to see.
I don't think the digital world was ever high trust. I mean, everyone above a certain age is trained to never click on the biggest download button on a page and to uncheck any checkboxes during an installation. A certain open source forge used to bundle malware in downloads. You can't walk two steps without hitting cloudflare. All email providers consider random VPS IP ranges to be spam farms. All web servers with public IPs must be up to date or you get pwned instantly and assimilated into a bot farm.

I can go on and on about how much safety measurements we take online since ages ago and how little trust we have for anything that comes through an Ethernet port. I have never needed such levels of vigilance in real life even though I live somewhere with higher crime rates compared to the national average.

Not even just digital; much of the world is shifting from high trust to low trust as well: https://social.desa.un.org/sites/default/files/inline-files/...
There are currently a lot of people in the upper echelons of our society who repeatedly and vigorously abuse the high-trust digital world.

We based all of this on gentlemen's agreements and handshakes. That let quite a few people get only very wealthy, instead of hyper-wealthy. Thus those agreements have to be shredded.

AP mentions this in the link:

> Section 227(g)(4). Enforcement. Statutory damages of not less than $500 per server request made in violation of this section, consistent with the per-violation damages established under the original Act for unsolicited facsimile transmissions.

While this is at least something, it's not going to dissuade a startup from doing this sort of thing. They'll find ways to hide the origin of traffic, or just soak up the costs with more VC money.

You need to start throwing people in prison for long periods of time (10+ years) for this sort of thing to stick.

This is kind of crazy. The digital world has never been a "society" except perhaps for the first few years after ARPANET was invented, and it certainly hasn't been a high-trust one for almost as long - we've had spam filters, user account registration required to comment, various authentication methods, moderation, and various things you get in a low-trust environment for decades now. To think otherwise is a bit delusional.
To whom would you attribute the greater part of that reduction in trust: the people using FOSS to train LLMs, or the people trying to block them?
People who break the social contract are the ones responsible for breaking the social contract, not the ones who take steps in response to social contract being broken.
So the questions here are (a) is any generally accepted social contract actually being broken, and (b) if so, who are the ones who are breaking it?
The contract behind open source was something like (GPL):

"If you copy my work, you should share your work too."

or at minimum (MIT):

"If you copy my work, you should credit me."

I think it is no longer under dispute that the legal contract is satisfied by LLMs. The AI companies won and will continue to win.

But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way. What did the authors mean by "copy"? Did they mean literally CTRL+C, CTRL+V or something broader?

This is a matter of opinion which only each individual creator can answer. For me, copying meant something like:

"To reproduce the function of my work, dependent on my having published it, without effort nor understanding of your own"

Ten years ago this basically required doing a CTRL+C, CTRL+V so there was no need to be more specific. Anybody who did enough work to, say, rewrite in another language (with that language's idioms), met the bar of clause 3. Now AI enables a form of "copying" that matches my definition, without the user even being aware of whose works they are copying. It perfectly launders the origins of its output. It can write an FFmpeg clone in Rust for you that would appear to be a novel work.

Of course, I cannot say that my own little bits and pieces of open source code would make a scratch in AI's capability, were it removed.

But I do strongly believe that if all the code that was published by authors with the same mindset was unavailable, Claude would be a far weaker developer.

> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.

Perhaps this illustrates a fissure that was always lurking under the surface, then. The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate, and that licenses like the GPL were meant to use copyright law to achieve something that resembles the state of affairs that might exist if copyright didn't exist in the first place. That's what the whole concept of "copyleft" always seemed to imply.

Now we have a new class of technologies that is admittedly fraught with a wide range of risks and pitfalls, but also a lot of promise to enable people to actually put the "four freedoms" into practice in ways they couldn't before, and we're seeing people who have normative opinions about AI derived from other, unrelated principles trying to circle the wagons and exclude those use cases. That is what seems like a breach of the social contract as I've always understood it.

> Did they mean literally CTRL+C, CTRL+V or something broader?

Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else. "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to, and the whole point of "copyleft" was to lessen the restrictions on even that.

> "If you copy my work, you should share your work too."

Not exactly. The GPL way is that you should share my work under the same terms if you want to share it, even if modifying it.

You are not required to share anything if you don't actually share anything, and just run it yourself. That's where all the criticism towards cloud providers who freely use FLOSS is directed.

> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.

There is clearly a misalignment in expectations from some FLOSS enthusiasts. The main FLOSS licenses focus exclusively on distribution, but their expectations somehow extend well beyond distribution. We hear those FLOSS enthusiasts criticize and attack companies for using software exactly according to their terms, and somehow that is framed as abuse if said users happen to be bigger than some arbitrary boundary.

No one consented to training llms, as the op clearly implies, if they had been asked they would have declined to do so. As would all of the many copyright holders who are in the process of suing the model companies.
Are you asking how AI coding agents, the companies selling them and the individuals using them break the FOSS social contract (copyleft, attribution, upstreaming), or are you disputing that they do?
Both would resolve to the same question, no?

There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code. I've yet to encounter any strong arguments substantiating this premise as a general principle, and my own suspicion is that it is not valid as a general principle, given the nature of how LLMs operate.

It's certainly possible that specific instances of LLMs lazily copy-pasting code from public repos may exist, and the extent to which this is happening is something that can be substantiated by empirical examples, so if you have any to point to, I'd be interested in looking at them. However, where this is happening, it ought to be regarded as a failure modality of LLMs, and not something that implicates the underlying nature of LLMs, given that their intended purpose is to function as stochastic generators that do not merely copy-paste input data.

My initial feeling here is that using open-source code to train LLMs is not per se a violation of the generally accepted FOSS social contract, but rather that attempting to restrict specific use cases of FOSS-licensed code on the basis of normative opinions unrelated to the license terms is a violation, or at least a rejection, of that social contract. I'm not fully committed to this position, though, and would welcome well-reasoned arguments to the contrary.

Yes, and obviously: bots crushing servers in strict contravention of the robots.txt rules.
“No, no, what was she wearing?”
People who take steps in response to social contract being broken are the ones responsible for the steps they've taken, not the ones who break the social contract.
Its definitely the ones DDOSing websites while giving no attribution in any way to the original creators.
DDOSing websites seems to be an unrelated problem, and one that has traditionally been solved through response throttling and IP blocking.

Attribution is often required even on MIT or BSD licenses where code is being redistributed, either in original or modified versions, but that would relate to this discussion only to the extent that one regards using LLMs whose training data included a certain bit of code as itself constituting redistribution of that specific code -- but that in turn is a very debatable premise which really ought to be argued for, and not merely argued upon as though it is already generally recognized as true.

Why? You stole my stuff and now are pretending I need to argue for you to stop stealing it. It's a joke.
This is the very question under debate. Training LLMs on publicly available data is a novel situation, and neither law nor social opinion have settled a consensus on the subject.

Copyright maximalists like to borrow unearned moral weight for their position by conflating copyright infringement with "stealing", but this is not actually true in any legal sense. It's not clear that training an AI on publicly available data should even constitute copyright infringement, much less "stealing".

What? What is being "stolen" from you?

Are you now layering the old and tired "copyright infringement = stealing" argument on top of the still unsubstantiated premise that all LLM training is copyright infringement?

> The sender pays, not the receiver.

You have a hole here. Your web server is sending the response and the bot is receiving.

Fix that and … profit? :-)

I'm trying to compose a better wording, but my attempts aren't working. The best I've got is:

> The initiator of the communication pays, not the server operator.

oh good point got that backwards… OMG my fax brain didn’t even think about it.
Really hate to say it, but I’ve stopped publishing my work too for this reason. I spend most of my time now building my own little software ark, and I aspire to no longer think of programming in the next few years. I feel like the creative economy in general will be unrecognizable in the near future, maybe nonexistent. I wonder what modes of collaboration on ideas might form in the next few years.
Here is what the purveyors of AI don't seem to realise. You can bend copyright law all you want in order to train your models on whatever you can grab, but in the absence of genuine protection of their creative work authors are simply not going to be publishing at all.
I think they see it all too well. They still think they can make bank today while it lasts, whatever comes after is some other shareholder's problem. And if we're talking about open source, killing it might be a positive side effect, they'll be ready to sell you a closed source alternative when you no longer have options.
I don't think we're going back to closed source. I think we're going back to guilds. Aka. closed knowledge.
Furthermore, if people not only stop publishing, but also take down already published works, it will create a moat around already existing Language Models

And the more they DDOS small websites — instead of respectfully scraping once — the more realistic my conspiracy theory looks.

People who are making stuff because they want to share it are still going to be publishing. And fighting to be noticed in an unending torrent of slop.
Without any material or immaterial benefits? And with one's work being ground up and turned into weights for the next version of the machine that's threatening one's employment?
I personally am sharing stuff because I want people to read my comics, and maybe join my crowdfunding campaigns.

If I could put everyone pushing all this AI crap into a meat grinder, I would.

> People who are making stuff because they want to share it are still going to be publishing.

Those people who do that are too few and far between to make a difference. The majority of open source devs aren't giving away the source without a license. That license is how they specify what they want in return.

> The majority of open source devs aren't giving away the source without a license.

100% of open source devs aren’t giving away the source without a license, since a licence—the grant of permissions for what is otherwise exclusive to author under the law—is what makes something open source.

> That license is how they specify what they want in return.

No, the license is how they legally give away permission to use material that is legally subjejct to their exclusive rights by virtue of creation. The license may be a contract license that, as you suggest, involves mutual exchange of value, but for many (especially permissive) open source licenses it is a gratuitous bounded grant of permission which has limits but does not involve giving something of value back to the creator.

> No, the license is how they legally give away permission to use material that is legally subjejct to their exclusive rights by virtue of creation. The license may be a contract license that, as you suggest, involves mutual exchange of value, but for many (especially permissive) open source licenses it is a gratuitous bounded grant of permission which has limits but does not involve giving something of value back to the creator.

Wrong. What they want in return is either credit or derivatives of the software. It's disingenuous to suggest that all these authors specifying, in a legal document, the exact mechanism by which to pay them back don't know what they are asking.

If you're not happy with that trade, then don't make it.

Great. More work for AI then.
The sad thing is I feel trapped on all sides of the debate, I wrote a book about LLMs and human creativity (spoiler Humans win for a long time) but I was going to do it as a blog series, instead I published https://www.amazon.com/dp/B0GXCSY4W8 because I felt at least I might get a bit back for literally 100’s of hours of my life I poured into the book and my editor and friends who read and provided reviews.

And I push a lot of open source code including a ton for the SWGEmu project, but now I’m of mixed mind to stop pushing anything public. I can’t decide, am I talking out of both sides of my mouth, it’s a confusing time to navigate for sure.

Indeed sad, congrats on publishing your book though. I’ve certainly felt a bit of that same angst myself.

I think SWGEmu (cool project, just learned of it from you!) do represent some optimism though. Maybe these sorts of passion projects will take over the space?

> Really hate to say it, but I’ve stopped publishing my work too for this reason.

Me too; not that I've published a lot, but definitely more than most. That won't be happening anymore.

Incredibly rich to complain about LLM scraping with LLM generated article.