| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 818 days ago

The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971

Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!

A couple of things I've written about this:

- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...

- https://simonwillison.net/2023/Apr/17/redpajama-data/

2 comments

ssgodderidge 818 days ago

Wow, that paper was super useful. Thanks for sharing. Page 2 is where it shows the breakdown of all of the data sources, including % of dataset and the total disk sizes.

link

shnkr 818 days ago

my question was specific to databricks model. If it followed llama or openai, they could add a line or two about it .. make the blog complete.

link

comp_raccoon 818 days ago

they have a technical report coming! knowing the team, they will do a great job disclosing as much as possible.

link