| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brianr 1156 days ago
	This analysis misses the impact of AI models being deployed, like is happening rapidly right now. Production applications built on AI will provide ample (infinite?) additional training data to feed back into the underlying models.

1 comments

haldujai 1156 days ago

Not sure that synthetic or LLM-generated training data is as useful as human generated text.

It seems "good enough" (for now) but synthetic makes up a very small proportion of the training set being used in current models that have been trained on it, if that proportion ends up being mostly synthetic we'll likely see whatever weird hallucinations and biases in the dominant backend (GPT4 or whatever) become amplified.

It's been shown repeatedly that garbage in = garbage out for training data.

link

brianr 1152 days ago

Agree about synthetic data. My point is that AI-powered applications that are deployed in production generate more _real_ data which can be used for training. For example, self-driving cars generate tons of data about how their models perform, as a result of the cars driving around. Similarly, code-writing AI applications will generate feedback in the form of errors, logs, etc. which is can be fed back into the models as training data.

link