|
|
|
|
|
by cshores
547 days ago
|
|
It ultimately doesn't matter because a fairly current snapshot of all of the world's information is already housed in their data lakes.
The next stage for AI training is to generate synthetic data either by other AI or by simulations to further train on as human generated content can only go so far. |
|
If there is untapped signal in existing datasets, then learning processes should be improved. It does not follow that there should be a separate economic step where someone produces "synthetic data" from the real data, and then we treat the fake data as real data. From a scientific perspective, that last part sounds really bad.
Creating derivative data from real data sounds, for the purpose of machine learning, like a scam by the data broker industry. What is the theory behind it, if not fleecing unsophisticated "AI" companies? Is it just myopia, Goodhart's Law applied to LLM scaling curves? Some MBA took the "data is the new oil" comment a little too seriously and inferred that data is as fungible as refined petroleum?