Hacker News new | ask | show | jobs
by danielmarkbruce 620 days ago
Did you read this paper? No one is suggesting o1 was trained with 100% synthetic or 50% or anything of that nature. Generalizing that "synthetic data is bad" from "training exclusively/majority on synthetic data is bad" is dumb.

Researchers are using synthetic data to train LLMs, especially for fine tuning, and especially instruct fine tuning. You are not up to date with recent work on LLMs.

1 comments

> No one is suggesting o1 was trained with 100% synthetic or 50% or anything of that nature.

Neither was I.

> "synthetic data is bad“

I never said that… I said that it makes for poor training data, which it does.

> Researchers are using synthetic data to train LLMs, especially for fine tuning, and especially instruct fine tuning

Then those researchers are training with subpar datasets as the bias in that data will be compounded.

It’s a trade off since there’s only so much fresh data in form you want. If they could use entirely non synthetic data, I’m sure they would.

And again, you’re choosing to focus on this one point rather than my main point that prompt provide no moat.

> You are not up to date with recent work on LLMs.

There you go again making assumptions…

I think I’m done with this conversation though.