The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971
Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!
Wow, that paper was super useful. Thanks for sharing. Page 2 is where it shows the breakdown of all of the data sources, including % of dataset and the total disk sizes.