| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by web007 1520 days ago
	https://the-eye.eu/public/AI/ for a couple of the models shown, as well as the input training data "The Pile" used for GPT-J-6B. They're... a bit weird. The sources are biased toward "internet content" IMO (Literotica, HN, GitHub, Stack Exchange), versus traditional sources like newspaper articles or other professional writing. Some of the other sources might balance that out in that they're as dry and complex and squeaky clean as you can get (EU proceedings, case law) but I'm skeptical it won't end up as least common denominator tweetable content that's missing the detail and style of more professional pieces. I know the OpenAI GPT series was trained partially on Reddit content, so that as a baseline isn't much better.