Hacker News new | ask | show | jobs
by pgao 2165 days ago
Yes, your interpretation is correct. I don't think we're going to get into synthetic data generation in the near term, mainly due to the amount of effort required + questions about domain transfer. However, we do improve dataset quality additively by sampling the best data to label + retrain on to get the best performance.

Said another way: once you've found "I do badly on green cones," we use similarity search on the embeddings of known green cone examples to find more instances of green cones in the wild. We pick the right examples from streams of unlabeled data, then send it to labeling + add to your dataset so it does better the next time you retrain.

1 comments

I like this much better than synthetic data augmentation actually. I think synthetic augmentation, like with GANs is actually a failed concept.

There has long been theoretical limits around how much you can gain by ensembling with a model of known limitations, and this is all that synthetic training data is at root.

You can’t “make up” training data that allows you to escape the ceiling of performance implied by whatever generator process you use for the synthetic data, no differently than you can’t learning a better regression just by bootstrapping a large sample of data from your existing training set.

Algorithmic synthetic data is a big type of fool’s gold.