| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alchemist1e9 1301 days ago

It’s odd how little discussion there is on inputs because the more reputable the inputs the more likely it can be trusted. I’d really like to know the body of knowledge it has been trained on.

My guess why this is obscured is legal, in that they have used a massive body of copyrighted data, and hope to avoid controversy over the inputs by trying not to talk about it.

I had seen once a huge collection of links to curated input data sets for language models but haven’t been able to find it yet in my notes/bookmarks unfortunately.