| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hackerlight 794 days ago
	Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

2 comments

vessenes 794 days ago

Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.

link

YetAnotherNick 794 days ago

No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.

link

astrange 794 days ago

I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.

link

torginus 794 days ago

Except everything that comes out of an LLM (like GPT4) is highly suspect (at least in my experience).

link

samus 793 days ago

1. They need it for style and language, not necessarily for the facts

2. Since GPT-4 is seen as the very best general-purpose LLM in existence, it makes sense to emulate its performance with less resources.

3. Phi models are also trained with other high-quality data

link

xarope 794 days ago

perhaps that's the best path forward? Text and reference books (hopefully unbiased) for answers, and web scraped data for conversational tone.

link

minimaxir 794 days ago

Yes, "chinchilla optimal" is a meme, but 15T might turn out to be too many tokens.

link

wrsh07 794 days ago

My understanding from this tweet thread [1] is that chinchilla probably overspecified some of the hyperparameters to the model

tl;dr I'm looking forward to having lots of models (ideally models) trained with a wide range of parameters to narrow down "what is actually optimal"

I think there is an interesting tradeoff of data quality and data volume, though

(Eg if we train with the highest quality 10% of our data, does the model improve if we use the other 90%? What if we increase our data size by 10x?)

[1] https://twitter.com/tamaybes/status/1780639257389904013

link