Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.
Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.
No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.
I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.