| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by marmarama 1576 days ago
	If GPT-3 can produce procedurally generated web content this convincing, search engines are screwed, right? We won't be able to find anything useful on any current search engine because there's no straightforward algorithmic way to tell useful content from endless link farms full of utterly convincing but totally useless content.

11 comments

jerf 1576 days ago

Yes, I think we're still a couple of years from this becoming an intractable problem, but it's absolutely coming.

Startup entrepreneurs in the mood for a Hail Mary play take note. How do you have a web search engine in a world where there no longer exists any algorithm for telling spam apart from real content? "Go back to the original Yahoo" is a decent start but certainly nowhere near a complete answer in 2022!

My guess is that it may not even take the form of what we have today, with an arbitrary text box. Maybe you have to go down to a specific category at least. Who knows. I sure don't. All I can say is that it sure looks to me like the spammers are only a year or two from effective total victory in the current paradigm.

omegalulw 1575 days ago

Just because people cannot tell generated content from natural content doesn't mean ML classifiers can't. Training GPT-3 to recognize GTP-3 is w lot simpler than you think (more specifically, we do t have a good way of sampling from the long tail when generations which a model like GPT-3 can pick up on in a jiffy), especially since vast majority of people won't be able to find tune the model enough to diverge from base statistics. Throw in other features like domain trustworthiness, user click rate, etc and search engines should remain fairly reliable for mainstream searches. If you are searching niche content though, yes, there could be degradation

jerf 1575 days ago

Spammers aren't going to just "fire GPT-3" at the problem and then quail in panic when it doesn't quite work, any more than they have with any other technique.

The problem is the space between "AI generated" and "human generated" is fundamentally getting smaller. That's a real problem. We don't seem to be that many steps away from that space going to zero for generalized writing.

heuroci 1573 days ago

Reporting for duty...

sdrinf 1576 days ago

The floodgates are basically held back currently only by openAI's production go-live policies, which are (currently) 1, manual review with an interview; and 2, specifically forbids use cases that can be used for spamming at scale: https://beta.openai.com/docs/usage-guidelines/use-case-guide...

I'm not expecting this to hold for very long (openai has partnered with microsoft; and there are at least 2 groups currently working on replicating it open-source), and expect strong detoriation of overall web content soon afterwards.

smrtinsert 1576 days ago

I'd happily pay 5 bucks a month for a search engine searching only a curated list of sites.

Under such a system any company that begins producing spam could be removed, and we could go back to the lovely days of something simple like page rank being used to provide relevant results.

TechBro8615 1576 days ago

I would pay for this service too, but only if the list was personal to me, and I could add or remove sites from it.

It would also be cool if I could upload my own crawling modules, so I could index more than just websites.

Workaccount2 1576 days ago

In the story of the library of Babel, the librarians live in despair because despite having access to all the worlds information, they also have access to all the worlds disinformation, and all mixed together there is no way to tell which is which.

randomsilence 1575 days ago

An implementation can be found there: https://libraryofbabel.info/

joken0x 1576 days ago

Very good reference. It is a problem that will remain.

fudged71 1576 days ago

I'm now convinced that Google will show artificial results as an amalgamation of the other results and pocket the ad views for themselves. It's the logical conclusion isn't it? Question is how they would distinguish those results in search.

kingcharles 1576 days ago

It makes perfect sense. I guess the crux of Google Search is to give you an answer to a question. Do you care who gives you the answer as long as it is right?

fudged71 1575 days ago

I suppose the knowledge graph is a step towards this. But people will always want to read more about a subject and share a link to other people, so this seems like the logical next step!

moffkalast 1576 days ago

At least we can still use Google to search Reddit.

dividuum 1576 days ago

You mean this reddit? https://old.reddit.com/r/SubredditSimulator/

marstall 1576 days ago

total gibberish?

mgdlbp 1576 days ago

The original uses Markov chains, was usurped a couple of years ago by https://old.reddit.com/r/SubSimulatorGPT2/

jay00 1576 days ago

Until all reddit posts will be gpt-3 generated.

kelseyfrog 1576 days ago

It should be possible to train an upvote prediction model conditioned on submission title. This could then be used to optimize GPT-3-family models to produce text which had the highest predicted upvote response. It's a couple-weekend project and I'd be surprised if an AI-hobbiest hadn't done it already.

jazzyjackson 1576 days ago

In the trivial case, karma farming bots just keep a database of all Reddit history (it is a public dataset, few hundred gigabytes) and repost the top comments (top threads even) whenever they detect a reposted link (extra points for similarity / reverse image searching)

It’s a project I have on the back burner to analyze Reddit history to check what ratio of comments are actually original, and I’d like to build a link aggregator that sorts by novelty.

kingcharles 1576 days ago

I've thought about this too, and the fact that I've not seen such a bot so far is pretty unbelievable. It's not a huge amount of work to code it. Working across the whole of Reddit (or HN for that matter), it would gather an ungodly amount of karma (and awards) in a small amount of time.

moffkalast 1576 days ago

Everyone on Reddit is a bot except you.

randomsilence 1575 days ago

That must be Meta's long game with their social network. It's much easier to identify signals if you know who is sending.

joken0x 1576 days ago

Yes, I think the same. Search engines try to match what exists (finitely, so it will always be a limitation) of previously created content with our query or need to know, while AI can generate and adjust the answer to what we need or want to know, even for our purpose, intellectual level, etc. Basically, tailored responses.

da39a3ee 1575 days ago

> utterly convincing but totally useless content.

Whether it’s convincing depends on your background. Those in society who can distinguish reality from fiction will still be rewarded, for obvious reasons. So that’s a difference from our current world in degree, not in kind.

skybrian 1576 days ago

When say "this convincing" what are you basing it on?

dqpb 1576 days ago

That is true of human generated content as well, so I think that makes this a good thing in the long run.