Hacker News new | ask | show | jobs
by rg111 987 days ago
This is bad for the small guys, though.

Do you think that DeepMind, OpenAI, etc. don't have this dataset copied 10 times over? Think again!

3 comments

How is this different from running pirated software for example? Big business doesn‘t do it. And if they do they get fined heavily. Same should be true for illegal AI training data. Surely this is a problem that can be solved.
> How is this different from running pirated software for example? Big business doesn‘t do it.

All the current LLMs are trained on data scraped from random websites, with no regard for the website's copyright, aren't they?

Presumably because these big businesses have a theory training an LLM is 'fair use'.

Why wouldn't they treat books the same way they treat web pages?

I'm tired of tech companies abusing the public then asking for forgiveness.
I actually think this kind of use is purer to the original intent of the web, where everything on the internet was freely available and consumption was encouraged. "Information wants to be free" used to be the rallying cry.

That being said, it feels like there's also a shade of perspective from the old quote:

"In its majestic equality, the law forbids rich and poor alike to sleep under bridges, beg in the streets and steal loaves of bread." - assuming everything in public is fair game, then everyone is welcome to build a multi-petabyte database of text and use millions of dollars worth of GPUs to train an AI on it.

That last point is great. I've definitely seen a lot of people talking about how we need to let the little guy develop their own AI, with very little attention paid to the actual realistic costs of doing so. GPT-4 I believe cost $100 million ish to train.
Bold of you to think they're asking for forgiveness at all. I've seen mostly apologists.
I agree and feel the same way.

But it also feels like standing at the edge of the sea complaining about the tide coming in...I'm not sure there's really much that can / will be done about it.

We've tried nothing and are all out of ideas!
Except big business does it all the time. Your parent comment mentioned OpenAI, who does it.
It's a valid question, how can smaller companies be competitive in this business.

But saying "we steal to stay relevant, because we don't have as much funds as our competitors" is not the answer.

Your argument seems to be: why stop the little guys from doing something illegal if the big guys are doing it too? We should ideally stop them all, but it isn't surprising that the blatant examples of illegality are stopped first.
If datasets are not shared externally, like how deepmind etc. does it, it is much more of a grey area. Training may well end up being fair use. So yes, the efforts of Shawn compiling books3 just levels the playing field a bit.