|
|
|
|
|
by simonw
490 days ago
|
|
"The biggest models want to train on literally every piece of human-written text ever written" They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text. |
|
OpenAI is Uber with a slightly less ethically despicable CEO.
It knows it's flaunting the spirit of copyright law -- it's just hoping it could bootstrap quickly enough to make the question irrelevant.
If every commercial AI company that couldn't prove training data provenance tomorrow was bankrupted, I wouldn't shed an ethical tear. Live by the sword, die by the sword.