| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pphysch 585 days ago

How is synthetic data supposed to work? Broadly speaking, ML is about extracting signal from noisy data and learning the subtle patterns.

If there is untapped signal in existing datasets, then learning processes should be improved. It does not follow that there should be a separate economic step where someone produces "synthetic data" from the real data, and then we treat the fake data as real data. From a scientific perspective, that last part sounds really bad.

Creating derivative data from real data sounds, for the purpose of machine learning, like a scam by the data broker industry. What is the theory behind it, if not fleecing unsophisticated "AI" companies? Is it just myopia, Goodhart's Law applied to LLM scaling curves? Some MBA took the "data is the new oil" comment a little too seriously and inferred that data is as fungible as refined petroleum?

5 comments

joshribakoff 585 days ago

I tried to train an AI to guess the weight and reps from my exercise log but it would produce nonsense results for rep ranges I didn’t have enough training data for, as if it didn’t understand that more weight means less reps. I used synthetic training data and interpolated and imputed data for rep ranges I didn’t have data for using estimation formulas, the network then predicted better, but it also made me realize i basically made the model learn the prediction formula and AI was not actually needed and im better off using the prediction formula. But it also illustrates that the model can learn from a calculation or estimation the same way it learns from the real world, without necessarily needing to train exclusively in the real world. An ai car driving in a simulation may actually learn some of the formulas that apply both in the simulation and in the real world. The same simulations and synthetic data can also be just as useful for validation not just training. It’s not hard to imagine scenarios that are impractical, illegal or unethical to test in real life. Also, as AI becomes more advanced, synthetic data can be useful for generating superhuman examples. It’s not hard to imagine you could improve upon data from a human driver by synthetically altering it to be even safer.

pphysch 584 days ago

Thanks, I now can see synthetic data being used to patch up holes and deal with ethical issues.

I still don't see how it could address the volume problem, like needing 10x or 100x of current data to train GPT5.

cshores 585 days ago

As others have mentioned, Tesla is already implementing similar advancements. More broadly, a new AI framework called Genesis has emerged, capable of training robots in just minutes using purely synthetic data. It generates a virtual environment for the robot to "perceive" and train within, even though this environment doesn't physically exist. This is just one example. Another could involve an AI specifically trained to diagnose illnesses based on genetic information in DNA. The insights gained from this virtual scientist could then cross-pollinate with other AIs, enhancing their training and capabilities as well.

Nevermark 585 days ago

Competition between AI’s to solve problems better or faster than each other, but learning from each other, is another way to start with simple problems and naturally bootstrap increasing difficulty.

elfly 585 days ago

Synthetic data works as long as it is directed towards a clear objective and curated.

At one point someone generated a Python teaching book from a LLM, took that, trained a second LLM with that, and the new LLM knew Python.

If you are just dragging random content from the web and you don't know what's synthetic and what's human, that data may be contaminated and a lot less useful, but if someone wanted to whitewash their training data by replacing a part of it with synthetic data, it can be done.

RationPhantoms 585 days ago

Would you trust a ML self-driving algorithm trained on a "digital twin" of a city? I would. I view synthetic training data like a digital twin in which it can provider further control or specified noise to understand from.

scottLobster 585 days ago

No, because right now I'm working closely with some EEs to troubleshoot electrical issues on some prototype boards (I wrote the firmware). They're prototypes precisely because we know the limits of our models and simulations and need real world boards to test our electronics design and firmware on.

You're suggesting the new, untested models in a new, untested technological field are sufficient for deployment in real world applications even with a lack of real world data to supplement them. That's magical thinking given what we've experienced in every other field of engineering (and finance for that matter).

Why is AI/ML any different? Because highly anthropomorphized words like "learning" and "intelligence" are in the name? These models are some of the most complex machines humanity has ever produced. Replace "learning" and "intelligence" with "calibrated probability calculators". Then detail the sheer complexity of the calibrations needed, and tell me with a straight face that simulations are good enough.

Nevermark 585 days ago

Both are likely to be much better.

Simulations may not be good enough alone, but still provide a significant boost.

Simulations can cheaply include scenarios that would be costly or dangerous to actually perform in the real world. And cover many combinations of scenario factors to improve combinatorial coverage.

Another way is to separate models into highly real world dependent (sensory interpretation) and more independent (kinematics based on sensory interpretation) parts. The latter being more suited to training in simulation. Obviously full real world testing is still necessary to validate the results.

fennecbutt 578 days ago

Hey, let's shut down humanity because human behaviour can't be perfectly simulated.

kjkjadksj 585 days ago

What makes you assume your digital twin is actually capturing the factors that contribute to variation in the real data? This is a big issue in simulation design but for ml researchers its hand-waved off seemingly.

fragmede 585 days ago

Probably due to reports like these where the digital twin is credited with gains in factory efficiency.

https://www.forbes.com/sites/carolynschwaar/2024/12/09/schae...

joshribakoff 585 days ago

It either improves the results or it does not, i don’t think i see the problem.

Corrado 585 days ago

Isn’t this what Tesla does for their driving data? However it would fall apart if they didn’t have real world days to feed into it, right?

heavyset_go 585 days ago

> Would you trust a ML self-driving algorithm trained on a "digital twin" of a city? I would.

No, just as I wouldn't trust a surgeon who studied medicine by playing Operation. A gross approximation is not a substitute for real life.

fragmede 585 days ago

Hope you don't need surgery then! Suture training kits like these are quite popular for surgeons to train on. https://a.co/d/3cAotZ0 I don't know about you, but I'm not a rubbery rectangular slab of plastic, so obviously this kit can't help them learn.

heavyset_go 585 days ago

This is a reason I opted to have a plastic surgeon come in when I went to the ER with an injury.

I could've had the nurse close me up and leave me with a scar, which she admitted would happen with her practice, or I could have someone with extensive experience treating wounds so that they'd heal in cosmetically appealing way do it. I opted for the latter.

scottLobster 585 days ago

The difference being that you have to do a little more than that to become a board-certified surgeon. If a VC gives you a billion dollars to buy and practice on every available surgery practice kit in the world, you will still fail to become a surgeon. And we enforce such standards because if we don't then people die needlessly.

Nevermark 585 days ago

How a model learns doesn’t really matter. What works works.

How it is tested and validated is what matters.

There are lots of ways to train on synthetic data, and synthetic data can have advantages as well as disadvantages over natural data.

Creative use of synthetic data is going to lead to many cases where we find it is good enough. Or even better than natural data.

joshribakoff 585 days ago

What about a doctor who used a mix of training both on live patients as well as cadavers and models?

heavyset_go 585 days ago

Is this doctor able to learn new information and work through novel problems on the fly, or will their actions always be based on the studying they did in the past on old information?

Similarly, when this doctor sees something new, will they just write it off as something they've seen before and confidently work from that assumption?

phyalow 585 days ago

Um, augmentation (i.e. the generation of synthetic data) is a very very well known technique for improving learning.

Also whats with the hate for MBA’s?

Your comment is off kilter with the rules here.

pphysch 584 days ago

Synthetic data is being proposed here as a solution to extrapolate ML scaling.

Augmentation, interpolation, smoothing are different concepts.

phyalow 580 days ago

I think you're drawing an artificial distinction here. Synthetic data generation is fundamentally an extension of augmentation. When OpenAI uses expert generated examples and curriculum based approaches, that's literally textbook augmentation methodology. The goal of augmentation has always been to improve model fit, and scaling is just one aspect of that.

Your concern about extrapolation is interesting but misses something key when we generate synthetic data through expert demonstration or guided curriculum, we're not trying to magically create capabilities beyond the training distribution. Instead, we're trying to better sample the actual distribution of problemsolving approaches humans use. This isn't extrapolation rather, better sampling of an existing, complex distribution!

i.e. if you think about the manifold hypothesis then we know real data lives on a lowerdimensional manifold, and good synthetic data helps fill those gaps. This naturally leads to better extrapolation, it's pretty well established at this point.

TBH I think you are characterizing this as some kind of blind data multiplication scheme, but it's much closer to curriculum learning you start with basic synthetic examples and gradually ramp up complexity. So it isn't whether synthetic data is "real" or not, but if it effectively helps map the underlying distribution and reasoning patterns.

Funny enough, your oil analogy actually supports the case for synthetic data refined petroleum is more useful than crude for specific purposes, just like well designed synthetic data can be more effective than raw internet text for certain learning objectives.