Hacker News new | ask | show | jobs
by sanxiyn 1158 days ago
I don't think your "synthetic data on ImageNet" reference shows "synthetic data is already very effective". Since many people won't read the paper, here's what it says:

Training ResNet-50 on real ImageNet gives 73.09% top-1 accuracy, while training it on synthetic data (same resolution, same number of images) generated by this work gives 64.96%, which is SOTA compared to previous work's 63.02%. Therefore, synthetic data is worse than real data for now.

But synthetic data is not useless, because training on real data plus synthetic data is a bit better than both real data and synthetic data. (Accuracy here is different due to different methodology.) Using 1:1 real data and synthetic data improves accuracy from 76.39% to 77.61%. But using 1:2 is worse than 1:1 (77.16%), even if dataset became 50% larger. With 1:4, result is worse than not using synthetic data at all. So synthetic data at best can enlarge dataset by 5x, more likely just 2x.

1 comments

I wonder how much you can improve that scaling factor by using data augmentation techniques (noise, rescaling, recropping, rotation, changing colors, using normal maps, etc).