Hacker News new | ask | show | jobs
by gumballindie 894 days ago
> But it was on purpose not trained on the big “web crawled” datasets to not learn how to build bombs etc, or be naughty.

It wasn't trained on web crawled data to make it less obvious that microsoft steals property and personal data to monetise it.

1 comments

It was trained on "textbook quality" synthetic data + some high quality web data.

The question is - if we train a model on synthetic data generated by GPT-4 which has copyright issues, what is the status of this model? Will MS have to delete it as well? And all models trained with GPT-4 data?

> if we train a model on synthetic data generated by GPT-4 which has copyright issues

Is that the new directive from HQ? I see a lot of folks parroting this logic, ignoring that proceeds of crime are criminal themselves.