| Quick notes from first glance at paper https://research.facebook.com/publications/llama-open-and-ef...: * All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2] * They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10] * 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data. * Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8] * Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7) |
Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-