Hacker News new | ask | show | jobs
by michaelt 987 days ago
> How is this different from running pirated software for example? Big business doesn‘t do it.

All the current LLMs are trained on data scraped from random websites, with no regard for the website's copyright, aren't they?

Presumably because these big businesses have a theory training an LLM is 'fair use'.

Why wouldn't they treat books the same way they treat web pages?

1 comments

I'm tired of tech companies abusing the public then asking for forgiveness.
I actually think this kind of use is purer to the original intent of the web, where everything on the internet was freely available and consumption was encouraged. "Information wants to be free" used to be the rallying cry.

That being said, it feels like there's also a shade of perspective from the old quote:

"In its majestic equality, the law forbids rich and poor alike to sleep under bridges, beg in the streets and steal loaves of bread." - assuming everything in public is fair game, then everyone is welcome to build a multi-petabyte database of text and use millions of dollars worth of GPUs to train an AI on it.

That last point is great. I've definitely seen a lot of people talking about how we need to let the little guy develop their own AI, with very little attention paid to the actual realistic costs of doing so. GPT-4 I believe cost $100 million ish to train.
Bold of you to think they're asking for forgiveness at all. I've seen mostly apologists.
I agree and feel the same way.

But it also feels like standing at the edge of the sea complaining about the tide coming in...I'm not sure there's really much that can / will be done about it.

We've tried nothing and are all out of ideas!