Hacker News new | ask | show | jobs
by throwaway4aday 990 days ago
I'd like to point out that it has been shown that text models can be trained on purely synthetic data and perform at or above the level of models trained on human derived data. This works because you can use an LLM judge the quality of a particular generated sample which allows you to automate the process of picking high quality generations. It won't be long before this is done with generative art as well, a multi-modal model could be used to curate the output of some CC0 derived model and build up a much larger training set for a new model. You could also procedurally create data for training by generating images based on 3D scenes with various shaders applied to give them the look of different art styles. You could also use neural style transfer instead of or in addition to a shader to add more styles of images. You could use the multi-modal model to judge these images as well, selecting only the best. With that, you essentially have a fully automated pipeline for producing any size training set you want 100% synthetic except for the base 3D assets, shaders and example style images which you could source CC0 or buy license to.