Hacker News new | ask | show | jobs
by marmarama 1576 days ago
If GPT-3 can produce procedurally generated web content this convincing, search engines are screwed, right? We won't be able to find anything useful on any current search engine because there's no straightforward algorithmic way to tell useful content from endless link farms full of utterly convincing but totally useless content.
11 comments

Yes, I think we're still a couple of years from this becoming an intractable problem, but it's absolutely coming.

Startup entrepreneurs in the mood for a Hail Mary play take note. How do you have a web search engine in a world where there no longer exists any algorithm for telling spam apart from real content? "Go back to the original Yahoo" is a decent start but certainly nowhere near a complete answer in 2022!

My guess is that it may not even take the form of what we have today, with an arbitrary text box. Maybe you have to go down to a specific category at least. Who knows. I sure don't. All I can say is that it sure looks to me like the spammers are only a year or two from effective total victory in the current paradigm.

Just because people cannot tell generated content from natural content doesn't mean ML classifiers can't. Training GPT-3 to recognize GTP-3 is w lot simpler than you think (more specifically, we do t have a good way of sampling from the long tail when generations which a model like GPT-3 can pick up on in a jiffy), especially since vast majority of people won't be able to find tune the model enough to diverge from base statistics. Throw in other features like domain trustworthiness, user click rate, etc and search engines should remain fairly reliable for mainstream searches. If you are searching niche content though, yes, there could be degradation
Spammers aren't going to just "fire GPT-3" at the problem and then quail in panic when it doesn't quite work, any more than they have with any other technique.

The problem is the space between "AI generated" and "human generated" is fundamentally getting smaller. That's a real problem. We don't seem to be that many steps away from that space going to zero for generalized writing.

Reporting for duty...
The floodgates are basically held back currently only by openAI's production go-live policies, which are (currently) 1, manual review with an interview; and 2, specifically forbids use cases that can be used for spamming at scale: https://beta.openai.com/docs/usage-guidelines/use-case-guide...

I'm not expecting this to hold for very long (openai has partnered with microsoft; and there are at least 2 groups currently working on replicating it open-source), and expect strong detoriation of overall web content soon afterwards.

I'd happily pay 5 bucks a month for a search engine searching only a curated list of sites.

Under such a system any company that begins producing spam could be removed, and we could go back to the lovely days of something simple like page rank being used to provide relevant results.

I would pay for this service too, but only if the list was personal to me, and I could add or remove sites from it.

It would also be cool if I could upload my own crawling modules, so I could index more than just websites.

In the story of the library of Babel, the librarians live in despair because despite having access to all the worlds information, they also have access to all the worlds disinformation, and all mixed together there is no way to tell which is which.
An implementation can be found there: https://libraryofbabel.info/
Very good reference. It is a problem that will remain.
I'm now convinced that Google will show artificial results as an amalgamation of the other results and pocket the ad views for themselves. It's the logical conclusion isn't it? Question is how they would distinguish those results in search.
It makes perfect sense. I guess the crux of Google Search is to give you an answer to a question. Do you care who gives you the answer as long as it is right?
I suppose the knowledge graph is a step towards this. But people will always want to read more about a subject and share a link to other people, so this seems like the logical next step!
At least we can still use Google to search Reddit.
total gibberish?
The original uses Markov chains, was usurped a couple of years ago by https://old.reddit.com/r/SubSimulatorGPT2/
Until all reddit posts will be gpt-3 generated.
It should be possible to train an upvote prediction model conditioned on submission title. This could then be used to optimize GPT-3-family models to produce text which had the highest predicted upvote response. It's a couple-weekend project and I'd be surprised if an AI-hobbiest hadn't done it already.
In the trivial case, karma farming bots just keep a database of all Reddit history (it is a public dataset, few hundred gigabytes) and repost the top comments (top threads even) whenever they detect a reposted link (extra points for similarity / reverse image searching)

It’s a project I have on the back burner to analyze Reddit history to check what ratio of comments are actually original, and I’d like to build a link aggregator that sorts by novelty.

I've thought about this too, and the fact that I've not seen such a bot so far is pretty unbelievable. It's not a huge amount of work to code it. Working across the whole of Reddit (or HN for that matter), it would gather an ungodly amount of karma (and awards) in a small amount of time.
Everyone on Reddit is a bot except you.
That must be Meta's long game with their social network. It's much easier to identify signals if you know who is sending.
Yes, I think the same. Search engines try to match what exists (finitely, so it will always be a limitation) of previously created content with our query or need to know, while AI can generate and adjust the answer to what we need or want to know, even for our purpose, intellectual level, etc. Basically, tailored responses.
> utterly convincing but totally useless content.

Whether it’s convincing depends on your background. Those in society who can distinguish reality from fiction will still be rewarded, for obvious reasons. So that’s a difference from our current world in degree, not in kind.

When say "this convincing" what are you basing it on?
That is true of human generated content as well, so I think that makes this a good thing in the long run.