|
|
|
|
|
by sillysaurusx
1003 days ago
|
|
It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons. Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset. Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition). |
|
What fraction of the training data needed to be that text?