|
|
|
|
|
by haldujai
1156 days ago
|
|
Not sure that synthetic or LLM-generated training data is as useful as human generated text. It seems "good enough" (for now) but synthetic makes up a very small proportion of the training set being used in current models that have been trained on it, if that proportion ends up being mostly synthetic we'll likely see whatever weird hallucinations and biases in the dominant backend (GPT4 or whatever) become amplified. It's been shown repeatedly that garbage in = garbage out for training data. |
|