| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fractionalhare 2173 days ago
	If I understand correctly, it sounds like your platform is primarily intended for improving awareness and understanding of the data a team has, so they know which features to focus on and emphasize. Do you think you'll get into synthetic data generation as well? In other words, improving dataset quality additively, not just curatively.

1 comments

pgao 2173 days ago

Yes, your interpretation is correct. I don't think we're going to get into synthetic data generation in the near term, mainly due to the amount of effort required + questions about domain transfer. However, we do improve dataset quality additively by sampling the best data to label + retrain on to get the best performance.

Said another way: once you've found "I do badly on green cones," we use similarity search on the embeddings of known green cone examples to find more instances of green cones in the wild. We pick the right examples from streams of unlabeled data, then send it to labeling + add to your dataset so it does better the next time you retrain.

mlthoughts2018 2173 days ago

I like this much better than synthetic data augmentation actually. I think synthetic augmentation, like with GANs is actually a failed concept.

There has long been theoretical limits around how much you can gain by ensembling with a model of known limitations, and this is all that synthetic training data is at root.

You can’t “make up” training data that allows you to escape the ceiling of performance implied by whatever generator process you use for the synthetic data, no differently than you can’t learning a better regression just by bootstrapping a large sample of data from your existing training set.

Algorithmic synthetic data is a big type of fool’s gold.