Hacker News new | ask | show | jobs
by sevenzero 29 days ago
>which have indexed all of the books and used pirated copies to do so

Funnily enough, people on HN often do not consider this an issue, like at all... I wonder how they'd think about it if they had created something (meaningful) that was subjected to this. I love Go and learned it a lot in the past 2 years but ultimately put it down in favor of more "batteries included" solutions as I don't trust myself enough as a dev to confidently handle concurrency in Go. Still, it's a beautiful language and if I ever come back I hope I can still find books about it, as I hate using AI for learning.

6 comments

I have a different impression, that the folks here are divided in this issue, with a half being AI maximalists saying it's a necessary evil while the other half condemning such practices, maybe not as much as to protect copyright per se, but because there are two different measures here. While teenagers get ridiculous fines for sharing MP3, big corp gets the free pass for stealing data on a industrial scale.
If AI was public domain and free for everyone, I would have less issues with it (not saying no issues). But yeah, the only people actually benefiting from this are big tech corps who actively destroy society since over a decade now.

The argument about the ability to self host doesn't really make sense to me given that most of society can not even afford RAM at the moment. So all these big tech frontier models should be public domain.

> because of economic pressures

Self-hosting isn't relevant here anyway. When discussing the hoovering up of information irrespective of licences to produce the model, where the model is finally run isn't significant.

You might not be paying the industry pirates-at-scale to run a model on their hardware, but you are still using the same information, irrespective of the same desires of its creators, the same way, just in a different location.

Heck, local hosting might even be making the situation worse if people are trying to train their own model because they are then likely to be scraping data too, and becoming part of the army of bots that are pushing hosting costs up and forcing everyone to use tricks like PoW scripts that can inconvenience human readers as much as the scrapers.

> You might not be paying the industry pirates-at-scale to run a model on their hardware, but you are still using the same information, irrespective of the same desires of its creators, the same way, just in a different location.

For individual use I personally think it's ok. Access to information shouldn't be penalized or regulated, but distribution should. So in this case it's relevant where a bootleg model is run.

And another half being copyright abolitionists like me who don't care about AI at all but see copyright as essentially a societal fiction that even if it was useful in the past is now no longer, or rather, only useful to big corporations to throw their weight around like Disney who lobbied the government to implement their infamous Mickey Mouse laws with ridiculous copyright term limits.
I agree with you to an extent, but I think that when people profit from a work (e.g. by using it to train a proprietary AI that they charge people to use) they should share the profit with the author of the work.

So I think Anna's Aarchive is fine. OpenAI is not.

That's why I believe in open weight or even open source AI models. If you're gonna train you might as well democratize access to everyone, not the faux "democratization" that OpenAI and Anthropic talk about where only they control access.
I didn't want to go into this topic, but in right here with you, I'm an information access anarchist.
> I wonder how they'd think about it if they had created something (meaningful) that was subjected to this.

I used to write books in the past (all obsolete since, well, two decades+ now) and I'm totally fine with piracy: people who are pirating content are typically not those who are going to pay for it anyway.

As a sidenote I'd really wish that state resources spent fighting bad actors in society was first uses to catch and imprison rapists and the likes and not chasing pirates sailing the digital high-seas but I digress...

Priorities.

>I used to write books in the past (all obsolete since, well, two decades+ now) and I'm totally fine with piracy

Thats why I wrote meaningful. Two decade old books are depending on the topic rarely still meaningful (even if they might've been at the time of publishing). Talking about non-fiction here, as there's a ton of old but still relevant fiction out there. Nonetheless, if you would have published these last year or whatever, I think you'd think differently about it if your sales broke down by 50%+ due to AI.

> Funnily enough, people on HN often do not consider this an issue, like at all...

I didn't have a problem with it when it was Aaron Swartz, not sure why I should have a problem with it when others do it.

Aaron Swartz never did whatever it was he was going to do. He was caught and hounded to death before that.

But he was working with scientific papers— the outputs of public institutions— and his likely goal was releasing them to the public. What proprietary AI companies have done in training LLMs on every book in existence is nothing like that.

A lot of what they have done is the reverse. They have used a lot of such publicly funded information (and a lot of other freely available information) to train LLMs that are proprietary.
The strange thing is he picked a fight with a store of humanities papers rather than scientific ones.
JSTOR holds content from lots of journals including in the sciences. It's not only humanities papers.
1) those were scientific papers; the authors weren't getting paid either way (unless book authors making a living from them)

2) more importantly, Swartz wasn't building a business empire on the pirated data, and charging access

I don't see how the two are even remotely similar

A few years ago (before LLMs were as good as they are today) I wanted an LLM to do a RAG like memory on all the books I own. My dream was that every book I purchased would go into my LLM making it better but also giving me a reference back to the text to look up and help me get better.

Honestly I didn't expect LLMs to progress so fast. Now it just seems like an unnecessary solution to a problem that no longer exists.

I'd rather not have copyright at all, as I said in another comment it's not useful anymore. Information should instead just be free for everyone.
> > which have indexed all of the books and used pirated copies to do so

> Funnily enough, people on HN often do not consider this an issue, like at all...

That is far from true - opinion is quite divided, perhaps even close to 50/50. It sometimes seems that the opinion is skewed massively towards the positive because there are a lot more “look what I did with GenAI” stories because “yeah, I'm not doing that because… here's what I did the old way” doesn't catch interest in the same manner.

This is one of the (several) reasons I'm doing my level best to avoid using the tools - I don't want to pay in to the companies that have run ripshod over everyone's work because they can¹. This is a rather risky position to take in a company where the up-aboves have all but said “get with AI or get left behind”, but quite frankly at the moment “redundancy” isn't a scary word for me².

--------

[1] Take from a few (i.e. download a couple of TV shows) and it is piracy making you liable for huge fines or even prison time, take from practically everyone (hoover up all their published writing irrespective of licence, gum up their servers with your badly written, or well written but deliberately badly behaved, scraper, etc…) and that is perfectly valid for training purposes.

[2] I appreciate that for many this is not the case, and because of economic pressures they might have to compromise on their feelings if they have the same opinions as I do on GenAI.

>That is far from true - opinion is quite divided

That might be true if you look further into it. I am a casual frontpage reader and the frontpage usually is plastered with AI stuff. Either new bullshit benchmarks, AI workflows, AI editor updates, AI company did something bad (again), or cool(?) projects people vibecoded. I also had arguments about AI used for art on here before and my personal experience usually is people defending their slop art.

> That might be true if you look further into it.

You don't have to look much further into it. If you aren't making that much effort it is hardly anyone else's fault that you've got an inaccurate impression of how things are.