|
|
|
|
|
by lazide
2 hours ago
|
|
I don’t think they will improve, there is too much incentive to poison the datasets going forward. A lot of the models up to this point have been benefitted - like Google did - from essentially ‘pre SEO’ internet. Now the same tools are being used to generate nigh infinite good sounding bullshit, which poisons the dataset in all sorts of hard to detect ways. To add insult to injury, the human experts are also not as. Naive, and have many incentives to poison their own input in subtle ways too. |
|
For one, if your website/book is poisoned, who is going to trust it for anything at all, much less for training models?
For two, all the major AI labs hire or contract for subject matter experts to create curated data sets, evaluate model performance, etc.
Unless they hire malicious experts, this will provide a growing, high quality data set that should drown out any poisoned pretraining data.