Hacker News new | ask | show | jobs
by paulmd 1299 days ago
you ultimately can't, and there are certainly degrees of "organicness" even among organic content - a lot of content is essentially infomericals or arguments shilling a particular perspective they have a financial interest in shilling. And of course there's the case like the wikipedia editor who completely made up like 75% of the scottish wikipedia articles that have been the training inputs for language translation models etc, that is very organic content but it also is actually poison to train on!

The good news is the internet is relatively good at routing around the shit, for now. And I guess de-facto that is something you could apply to your content inputs: what's the pagerank for this content? actual pagerank, not the advertising/engagement bullshit that the search model has turned into. If the AI generated stuff is correct enough that it has a high pagerank, maybe it's correct enough to be used as an input.

but the thing is honestly there's already been an uptick in ML or AI-generated content that is already surfacing in searches and other places and it's not always correct... and honestly the relevance of google's search results has been noticeably decaying for 10+ years now. Things I know are out there and are relevant are not being surfaced anymore. Is AI generation contributing to that problem? Maybe. Probably not helping, at least.