| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by loudmax 841 days ago
	Training large language models takes an enormous amount of data. Ideally, multiples of Wikipedia and public domain content. Plus you want high quality data, so if you're going to pull in Reddit or something you need some way to separate factually accurate comments from garbage trolls. Using output from ChatGPT is one way to generate a large volume of high quality data. But this is expressly forbidden by OpenAI's terms of service so you can't advertise the fact that that you're doing this. OpenAI is on shaky ground if they go to sue though, because so much of their training was done on copyrighted material that they hadn't gotten permission to use to begin with.