Hacker News new | ask | show | jobs
by simonw 818 days ago
The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971

Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!

A couple of things I've written about this:

- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...

- https://simonwillison.net/2023/Apr/17/redpajama-data/

2 comments

Wow, that paper was super useful. Thanks for sharing. Page 2 is where it shows the breakdown of all of the data sources, including % of dataset and the total disk sizes.
my question was specific to databricks model. If it followed llama or openai, they could add a line or two about it .. make the blog complete.
they have a technical report coming! knowing the team, they will do a great job disclosing as much as possible.