| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vessenes 793 days ago
	Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.

4 comments

YetAnotherNick 793 days ago

No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.

link

astrange 793 days ago

I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.

link

torginus 793 days ago

Except everything that comes out of an LLM (like GPT4) is highly suspect (at least in my experience).

link

samus 793 days ago

1. They need it for style and language, not necessarily for the facts

2. Since GPT-4 is seen as the very best general-purpose LLM in existence, it makes sense to emulate its performance with less resources.

3. Phi models are also trained with other high-quality data

link

xarope 793 days ago

perhaps that's the best path forward? Text and reference books (hopefully unbiased) for answers, and web scraped data for conversational tone.

link