| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lobstersammich 1133 days ago

You can find GPT-2's training dataset list - at a high level - in the GPT-2 repository on Github: https://github.com/openai/gpt-2/blob/master/model_card.md#da... However, OpenAI goes dark after that regarding the 'data soup' that was fed into their LLMs. In general, start around 2019 and definitely by 2020 you'll notice that research labs became much less forthcoming about the data that went into their models. As far as I'm aware, BookCorpus is one of the more commonly-used 'large books dataset' that's been utilized in recent years to train large language models (LLMs) like generative pretrained transformers: https://12ft.io/proxy?q=https%3A%2F%2Ftowardsdatascience.com...

At my alma mater I remember the large-scale Google book scanning devices and what a herculean effort that was to digitize the largest university library system's books - University of Michigan - although only 7M texts from the entire collection of ~16 million texts: https://en.wikipedia.org/wiki/University_of_Michigan_Library) were digitized.I too was curious about the state of the Google Books project: https://www.edsurge.com/news/2017-08-10-what-happened-to-goo...

This is an interesting piece of ephemera from 2005, when Google started digitizing books at UMich: https://apps.lib.umich.edu/files/services/mdp/faq.pdf

As far as I recall, the Books project allowed the early n-grams functionality to be built out: https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-...

The Google Books Ngram Viewer tool is actually still in existence; you can play around with it here: https://books.google.com/ngrams/graph?corpus=0&content=Vorsp...