|
How is synthetic data supposed to work? Broadly speaking, ML is about extracting signal from noisy data and learning the subtle patterns. If there is untapped signal in existing datasets, then learning processes should be improved. It does not follow that there should be a separate economic step where someone produces "synthetic data" from the real data, and then we treat the fake data as real data. From a scientific perspective, that last part sounds really bad. Creating derivative data from real data sounds, for the purpose of machine learning, like a scam by the data broker industry. What is the theory behind it, if not fleecing unsophisticated "AI" companies? Is it just myopia, Goodhart's Law applied to LLM scaling curves? Some MBA took the "data is the new oil" comment a little too seriously and inferred that data is as fungible as refined petroleum? |