| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by f3z0 561 days ago
	Given that the top google results are now generated I think we already have a massive recursion problem. I think we would benefit from training a model specifically to just detect a likelihood of content being generated and then bias other models against the higher likelihood generated content so that we don’t end up with LLM echo chambers.

2 comments

eddyfromtheblok 561 days ago

Right. Google already has a solution https://deepmind.google/technologies/synthid/ Everyone insists on training theirs to look human generated so the horses have left the stable on this

link

tempodox 561 days ago

Isn't everybody always gushing about how LLMs are supposed to get better all the time? If that's true then detecting generated fluff will be a moving target and an incessant arms race, just like SEO. There is no escape.

link

LegionMammal978 561 days ago

Yep, that's what I've been thinking since people started talking about it. I hear that AI plagiarism detectors can never work, since LLM output can never be detected with any accuracy. Yet I also hear that LLMs-in-training easily sift out any generated content from their input data, so that recursion is a non-issue. It doesn't make much sense to have it both ways.

link

ipython 561 days ago

I wonder if the truth about sifting out synthetic training data is based on signals separate from the content itself. Signals such as the source of the data, reported author, links to/from etc.

These signals would be unavailable to a plagiarism/ai detector

link