|
|
|
|
|
by HanayamaTriplet
366 days ago
|
|
I can understand that years before ChatGPT would not have any LLM-generated text, but how much does the year actually correlate with how much LLM text is in the dataset? Wouldn't special-purpose datasets with varying ratios of human and LLM text be better for testing effects of "AI contamination"? |
|
Getting this weird information about newer datasets generally outperforming older datasets was more of a side effect of having a dataset evaluation system.
If you're trying to examine AI contamination specifically? There are many variables, and trying to capture them all in a laboratory dataset is rather involved.
For one, AI data out in the wild is "enriched" - it's very likely to be selected by users before being published (human feedback best of 4?), it can gather human interaction like likes/comments, it's more likely to get spread around if it's novel/amusing/high quality than it is if it's low quality, generic and bland. How do you replicate that in a lab setup? On a tight budget?