How is this different from running pirated software for example? Big business doesn‘t do it. And if they do they get fined heavily. Same should be true for illegal AI training data. Surely this is a problem that can be solved.
I actually think this kind of use is purer to the original intent of the web, where everything on the internet was freely available and consumption was encouraged. "Information wants to be free" used to be the rallying cry.
That being said, it feels like there's also a shade of perspective from the old quote:
"In its majestic equality, the law forbids rich and poor alike to sleep under bridges, beg in the streets and steal loaves of bread." - assuming everything in public is fair game, then everyone is welcome to build a multi-petabyte database of text and use millions of dollars worth of GPUs to train an AI on it.
That last point is great. I've definitely seen a lot of people talking about how we need to let the little guy develop their own AI, with very little attention paid to the actual realistic costs of doing so. GPT-4 I believe cost $100 million ish to train.
But it also feels like standing at the edge of the sea complaining about the tide coming in...I'm not sure there's really much that can / will be done about it.
Your argument seems to be: why stop the little guys from doing something illegal if the big guys are doing it too? We should ideally stop them all, but it isn't surprising that the blatant examples of illegality are stopped first.
If datasets are not shared externally, like how deepmind etc. does it, it is much more of a grey area. Training may well end up being fair use. So yes, the efforts of Shawn compiling books3 just levels the playing field a bit.