Hacker News new | ask | show | jobs
Show HN: Goopt – Search Engine for a Procedural Simulation of the Web with GPT-3 (github.com)
235 points by joken0x 1574 days ago
18 comments

I love this. It's the reification of the Dead-Internet Theory - a tangible artifact embodying the feeling that the internet was replaced by its own simulacrum powered by AI.[1] The existence of Goopt is the culmination of DIT as self-fulfilling prophecy. We can almost see the beginnings of an outline begin to form around an Internet Turing test. How well can we discern the real internet from the fake one. Consequently, what happens when the line becomes so blurred that we lose the ability to perceive the difference?

1. https://www.theatlantic.com/technology/archive/2021/08/dead-...

I think there will come a point when the traditional internet will also become diluted in artificiality, content from the procedural web will start to creep into the traditional web, and we won't be able to distinguish. It's interesting to think of it in terms of Baudrillard's Simulacra and Simulation.
We're passed that point. Probably by a couple of years at a minimum. No sarcasm. Content farms are definitely using techs like this. GPT-3 is really good at generating text but still has some characteristic failures, and I encounter content farm web pages (despite my best efforts) that have clearly used it or something like it as a tool. Even just in the last couple of weeks I've been seeing some new, innovative content farms managing to pollute my search results that I've not seen before.
Oh boy, we need AI powered blockers that filter out this stuff.
This seems like a terribly difficult problem. I'm sure today there are some give-away signatures, but I put my money on the impersonators in the long run.
One of my current heuristics is the expression of weirdness. I can see it in the way I write, there's like this thread of a human processing information through emotional and idiosyncratic experiences. There's a certain cadence to the way thoughts come out. Maybe it can be emulated eventually, but that might be farther off.

In my fantastical imagination of that implementation, I would simulate "sim agents" as they live a life, collecting unique experiences that are the result of interactions with other sims. They then build up personalities, which influence their expressions in other mediums that real humans would consume. It's psychotic and possibly evil and I thought of it first, so no one better take my idea.

You might be able to see what I mean just in this comment itself. Could this have been written by an AI?

AI fighting against AI, everything will be AI.
This is how it ends the robots will figure out the real problem is people, better prepare for our new overlords maybe we can get into a human zoo..
Anathem by Neal Stephenson has a great subplot about this. The information age has been a bit stunted because there's too much crap[1] on the internet. Companies sprung up selling filters that would block websites with low-quality or AI generated information. Eventually these companies realized they could drum up business by generating low quality[2] content themselves, especially if they could get it past their competitor's filters. The end result was that the internet became a convoluted morass of bullshit and lies, difficult for non-experts to extract useful knowledge

[1]: Or maybe CRAAP https://en.wikipedia.org/wiki/CRAAP_test

[2]: They quickly realized the trick is make high quality low quality content. 100 pages of gibberish is way less effective than a convincing essay that happens to include a few key falsehoods.

It seems to be a loop of information from which there is no way out, that devours the information and stirs it among piles of garbage that it generates without stopping. Labyrinth and crypt at the same time, growing and churning more all the time. I think we have to start a serious effort to collect and save the human creation, otherwise it will become a hidden treasure buried among layers of artificial garbage. Imagine, if it can be difficult for us to distinguish the human content from the synthetic, how will it be for future generations who will not even have the living context of our time? And who will already be more accustomed to these synthetic contents than to the properly human ones.
How will this change how we evaluate content? Will digital media be socially and culturally devalued? Will print media gain greater status? Live spoken word? Curated content with a reputation layer? Obviously it's pure speculation, but let's indulge for a moment.
His latest book, “Fall, or Dodge In Hell” also uses the same idea as a major plot device. And alike all things written by Neal Stephenson, is amazing.
> I think there will come a point when the traditional internet will also become diluted in artificiality

There is already a concern in corpora creation in ML/AI projects. Researchers would like very much to have human-only generated content when training models on internet-sourced text. People posting GPT-created output has the potential to taint these corpora and create all sorts of strange loopy feedback.

Knowing people it wouldn’t take long for incoherent generative syntax to become a meme and reinforce the human corpora with new syntactic slang.
That's right, the human creation is and will become more and more a very precious treasure.
We are still missing a decentralized identity and reputation layer. I imagine in the future, people will sign their comments with a key that proves they’re a real human.
I would simply ask GPT-n what to write next, copy-paste it into the text box, sign it with my key, and hit reply. Its ability to write better comments than me would serve to gain me reputation rather than reduce it as so often is the present case.
AI might mine cryptocurrencies on their own, and then buy keys from desperate humans, and post as if they were the owner of the keys.
Man, how crazy. Continuing like this, they could create their own identification keys for machines and despise us. It seems that in this whole game only they end up winning.
Robots will of course have keys they can sign with, so perhaps we will rely on government digital ids such as he Estonian public key system. In that case, governments will have a monopoly on sockpuppet accounts.

Has there been any successors to web of trust ?

This. It's the elephant in the room for so many of our problems related to abuse, trolling, influence campaigns, garbage content, attribution etc etc.

OTOH, is proof of personhood really what we want long term, or is that just a proxy? Hypothetically, if an AI is good & trustworthy enough, why not allow a higher "source rating" for that than low quality human content?

Relevant XKCD: https://xkcd.com/810/
Hehe hadn't seen that. For sure. To critize the strip (treading on thin ice, I know), I think the core issue is not the bots generating content, but that they're participating in the voting.
This is exactly what a bot would say.
Hahaha we could all be bots here. Bip bip.
If GPT-3 can produce procedurally generated web content this convincing, search engines are screwed, right? We won't be able to find anything useful on any current search engine because there's no straightforward algorithmic way to tell useful content from endless link farms full of utterly convincing but totally useless content.
Yes, I think we're still a couple of years from this becoming an intractable problem, but it's absolutely coming.

Startup entrepreneurs in the mood for a Hail Mary play take note. How do you have a web search engine in a world where there no longer exists any algorithm for telling spam apart from real content? "Go back to the original Yahoo" is a decent start but certainly nowhere near a complete answer in 2022!

My guess is that it may not even take the form of what we have today, with an arbitrary text box. Maybe you have to go down to a specific category at least. Who knows. I sure don't. All I can say is that it sure looks to me like the spammers are only a year or two from effective total victory in the current paradigm.

Just because people cannot tell generated content from natural content doesn't mean ML classifiers can't. Training GPT-3 to recognize GTP-3 is w lot simpler than you think (more specifically, we do t have a good way of sampling from the long tail when generations which a model like GPT-3 can pick up on in a jiffy), especially since vast majority of people won't be able to find tune the model enough to diverge from base statistics. Throw in other features like domain trustworthiness, user click rate, etc and search engines should remain fairly reliable for mainstream searches. If you are searching niche content though, yes, there could be degradation
Spammers aren't going to just "fire GPT-3" at the problem and then quail in panic when it doesn't quite work, any more than they have with any other technique.

The problem is the space between "AI generated" and "human generated" is fundamentally getting smaller. That's a real problem. We don't seem to be that many steps away from that space going to zero for generalized writing.

Reporting for duty...
The floodgates are basically held back currently only by openAI's production go-live policies, which are (currently) 1, manual review with an interview; and 2, specifically forbids use cases that can be used for spamming at scale: https://beta.openai.com/docs/usage-guidelines/use-case-guide...

I'm not expecting this to hold for very long (openai has partnered with microsoft; and there are at least 2 groups currently working on replicating it open-source), and expect strong detoriation of overall web content soon afterwards.

I'd happily pay 5 bucks a month for a search engine searching only a curated list of sites.

Under such a system any company that begins producing spam could be removed, and we could go back to the lovely days of something simple like page rank being used to provide relevant results.

I would pay for this service too, but only if the list was personal to me, and I could add or remove sites from it.

It would also be cool if I could upload my own crawling modules, so I could index more than just websites.

In the story of the library of Babel, the librarians live in despair because despite having access to all the worlds information, they also have access to all the worlds disinformation, and all mixed together there is no way to tell which is which.
An implementation can be found there: https://libraryofbabel.info/
Very good reference. It is a problem that will remain.
I'm now convinced that Google will show artificial results as an amalgamation of the other results and pocket the ad views for themselves. It's the logical conclusion isn't it? Question is how they would distinguish those results in search.
It makes perfect sense. I guess the crux of Google Search is to give you an answer to a question. Do you care who gives you the answer as long as it is right?
I suppose the knowledge graph is a step towards this. But people will always want to read more about a subject and share a link to other people, so this seems like the logical next step!
At least we can still use Google to search Reddit.
total gibberish?
The original uses Markov chains, was usurped a couple of years ago by https://old.reddit.com/r/SubSimulatorGPT2/
Until all reddit posts will be gpt-3 generated.
It should be possible to train an upvote prediction model conditioned on submission title. This could then be used to optimize GPT-3-family models to produce text which had the highest predicted upvote response. It's a couple-weekend project and I'd be surprised if an AI-hobbiest hadn't done it already.
In the trivial case, karma farming bots just keep a database of all Reddit history (it is a public dataset, few hundred gigabytes) and repost the top comments (top threads even) whenever they detect a reposted link (extra points for similarity / reverse image searching)

It’s a project I have on the back burner to analyze Reddit history to check what ratio of comments are actually original, and I’d like to build a link aggregator that sorts by novelty.

I've thought about this too, and the fact that I've not seen such a bot so far is pretty unbelievable. It's not a huge amount of work to code it. Working across the whole of Reddit (or HN for that matter), it would gather an ungodly amount of karma (and awards) in a small amount of time.
That must be Meta's long game with their social network. It's much easier to identify signals if you know who is sending.
Yes, I think the same. Search engines try to match what exists (finitely, so it will always be a limitation) of previously created content with our query or need to know, while AI can generate and adjust the answer to what we need or want to know, even for our purpose, intellectual level, etc. Basically, tailored responses.
> utterly convincing but totally useless content.

Whether it’s convincing depends on your background. Those in society who can distinguish reality from fiction will still be rewarded, for obvious reasons. So that’s a difference from our current world in degree, not in kind.

When say "this convincing" what are you basing it on?
That is true of human generated content as well, so I think that makes this a good thing in the long run.
If you put your OpenAI key and start running this then they will ban your account because it will be against their TOS.

With some minor modifications you could port it to goose.ai and it isn’t against their TOS.

EDIT: Forking it here https://github.com/zitterbewegung/Goopt to add the functionality above.

OpenAI engineer here. Cloning this project and running it locally with your own API key does not violate any policies. However the way this project is configured, publishing it to the web would expose your API key in the client-side source code, which violates our policies since it would allow your account to be compromised.
Sorry I should have read the TOS
I think that as long as you don't put a live version (where you don't require your own API Key) and you don't misuse the results, there shouldn't be a problem.

Even so, it is a good idea to adjust, I'll keep an eye on your fork. Thanks!

It’s fine for not being live but switching over to goose.ai allows you to make it live I have a home server I could host it on .
I would like to put it live, although it would be necessary to review the terms to launch it live.

If you can make any adjustments that respect the terms, contact me to see it together.

I also have GPUs that are mine so that will definitely have any TOS at all but I would have to mirror the OpenAI API.
Also, for GPT-3 an advanced query would be helpful maybe to give it examples.
hey, you beat me to it :D did you manage to get it to work? I tried on my own but got error messages
I’m getting error messages I think it’s an issue where the endpoint is in a configuration where it is using the old openai value I think I have to recompile it with typescript or something .
I made it work but GPT-J doesn't respond in the same manner to the template prompts for search. Goopt cannot use and display GPT output from there as a search result.
You might have to enable billing, that was the issue for me
No it’s a config error.
What did you change ?
I set the base and engine in openai-api for goose.ai completions return fine, I'm seeing them fine in the log. However they are unusable to make Goopt work. The end of search prompts mock JSON format.

I get this \"content\": [{\"name\": \"Keto Diet\", \n \"typeClass\": {}, \n \"description\":[],\n \"contentImageUri\": null }]}

GPT-J doesn't seem to compose similar broken JSON so formatResults returns empty.

Maybe giving more examples can give a better result, for the case of GPT-3 I kept it to a few examples because of the cost of the amount of tokens.

Another option is to improve formatResults to fix more cases of invalid JSON responses. If you get any improvement a pull request is welcome.

This is clearly the future. All information will be generated on the fly and tailored for you. AI can match your level of knowledge, your language, your preferred style etc. AI can simplify / extend topics on demand, and also generate illustrations and videos to help explain topics.

I think most of the current form of pregenerated web with search actually becomes completely unnecessary, and it'll basically stop existing.

You get it, man, that is exactly the question. It is time to think about the many possibilities, problems, dilemmas, paradoxes, etc. It is very interesting and disturbing at the same time.
It changes everything. Thanks for coming up with this. I have thought about AI generated content before but not in this way. I just realized that we don't need the content web; we need just raw data sources and AI that generates content on the fly, for the user. The AI works for and is directed by the user; that's why it actually can reduce gibberish and make information more accessible and useful. This sets it apart from the current crop of content generation bots.
It's so cool someone made this, but

> The procedural web will be the future of the web. It will offer us infinite content

Yup it'll be infinite "garbage"

It's already garbage.

This will be better personalized garbage.

As the saying goes, "one person's trash is another person's gold."

Increased personalization of content at effectively infinite scales is a future I think most people really aren't wrapping their heads around.

There will be both good and bad applications of it, but when it finally crosses the threshold, it's going to hit like a tsunami.

The Internet built the infrastructure for 1:1 content, but there's simply never been the capacity for building out the creative.

When you are watching a TV show that's being written and rendered live specifically for you, incorporating your social media data to develop relevant topics and story arcs, and pulse rate and eye tracking to gauge and adapt for interest/emotional connection -- traditional content simply isn't going to hold a candle.

To me, that show is almost certainly going to be trash. But to you, it will be gold.

The components for that future are arriving faster than I ever thought they would, and while it is still a ways off, it's increasingly an inevitable result.

Your comment reminds me of people pointing out the uncanny valley years ago in claiming that computers would never create realistic looking humans. Just last week was research that not only can't people tell AI generated people apart from real ones, but they find the AI generated ones more trustworthy.

There's a huge difference between the beginning of implementation of a technology advance and its half-life.

"Quantity has a quality of its own"
the current web isn't to far off with SEO articles and ad-video autoplay
Exactly, it is inevitable, the traditional web will be diluted with the same garbage.
Not necessarily - federated media, webs of trust, and diligent curation across many smaller communities could allow for something that replaces Twitter, reddit, and centralized media hubs. Search within that context is easier - p2p/torrent streaming with crypto incentivized seeding can scale distribution.

The current state of adtech and near total surveillance isn't sustainable as more people wake up to the downsides, and as fake crap begins to accumulate.

Decentralization of social media, advertising, e-commerce and other web 2.0 staples will be a natural evolution of technology. The story goes "under Google's model of the walled garden web, SEO, spam, and bots achieved parity in all content metrics except actual meaningfulness to the user." Despite having all the compute and talent you could possibly bring together, Google is failing to uphold its core technology. They incentivized bad faith behavior, and are reaping the consequences of that. The acceleration of seo hacking and artificial worthless content is asymmetrical to the acceleration of the capabilities and market model Google has created.

A search engine can navigate self selected communities, human curated lists, and creatively bundle lists of lists to achieve high quality results based on actual humans self selecting and acting in their own interests. You can do things with higher quality classification and even provide regex over crawled data without huge technical barriers. Search agents will come about, whether locally or cloud hosted, and will eventually replace centralized engines like Google.

There are non doomed visions of the future. Maybe we won't suffer a digital trashocalypse.

> The procedural web will be the future of the web.

Isn't the "procedural web" built of mountains of (hopefully) human written content? How will the system get content about new subjects without the humans writing it? Isn't a system like GPT-3 currently limited to reflecting the ground truth data it has seen?

For how long? Think of the marketing and censorship opportunities when you can directly tune not just the content that gets seen but also the content itself! Content is still at least somewhat robust to censorship as it's sometimes difficult to remove all references to a banned book. Imagine if banning content also automatically rewrote all references such that they no longer made reference to the content? Or if one could simply pay and have all reviews of a mediocre book changed to make it the greatest book ever?

Note the above is a statement on some of the risks to a procedural web. Not a real market opportunity.

You'd have to use a trusted language model to get you banned information.
It's a good point. It's a total control of the word, in addition to being limited to only consuming, we cannot comment on our opinion or refute the lie.
> Isn't a system like GPT-3 currently limited to reflecting the ground truth data it has seen?

This limitation went away recently. A variant called RETRO (Retrieval-Enhanced Transformer) can use a search engine to take in the exact information up to date [1], assuming you can curate your own text corpus. It's also 25x smaller.

[1] https://deepmind.com/research/publications/2021/improving-la...

That's really cool. But unless I am misunderstanding this, that still puts the burden on the existing web though right, it's just avoiding having to retrain the model? If there is no economical market for humans to produce new content about a topic how will the search engine find the "ground truth" content?
You might want to use a limited subset of the web, a curated list of sources or feeds. Apparently 1TB of text could be enough, just need to collect it or download it from a trusted source.
So, suppose there is a new kind of cocktail that is popular in bars near me that nobody has written about under it's new trendy name.

How do I ask this system about the recipe, or the history of the cocktail? Someone has to write an article about it, right? How do they get paid if it gets scraped once and people go to the scraping model for the answer instead of visiting the original article's page?

Give it two years and we might have passable agents running on phones. There'll be a sufficiently powerful and small model that you can use with 8gb ram or less on desktop within a year.

These first large language models are naive, unoptimized implementations of data structures we're learning to inspect and optimize. Something like retro that runs locally with a "just clever enough" service agent is so close to workable. I can't wait to see what happens in ML over the next two years, and who knows what kind of radical evolution the next big algorithm is going to bring.

Oh I totally see that, the issue I'm talking about isn't one of compute, but of high quality ground truth. This machine can hallucinate all kinds of information in perfect English already. The difficulty is that a good search engine needs to return more than just information that matches my query, it should return information that matches the objective reality people (and currently not the machine) inhabit. The machine needs text input to learn about the world; is the future going to look like companies hiring people to write essays about the world for machine consumption?

I think it's a similar problem we see today with ad-supported news being indexed by search engines, but taken to another magnitude when those articles need to be scanned by a model only once to have near perfect recall of the details.

I haven't thought of that in years. I gave up my crusade to revive use of the interrobang(‽) in writing a while ago.

After reviewing the materials I see that Cuil Theory has come a bit further since I last read. I believe that Goopt would be somewhere around -2‽ from Cuil theory itself, negative because it's literal reality, but distant because it's an abstract embodiment.

Slightly off-topic. During my cursory reading I see that imaginary Cuil got fleshed out. I'd like a second opinion. The way it reads to me is that 'i‽' is almost the literal definition of solipsism.

This is an incredible idea. There are so many unexplored possibilities when you re-write and re-format the web.

Can you mix procedural and static content? How can you verify accuracy of information? What if you could refine a web page’s content just-in-time? Modifying the query and context etc.

Through a lens of Roam/Notion: what if everything were a block that could be individually linked? what if every block could be edited by anybody? what if anyone could add links and annotations across pages? a blend of web and wiki?

* Can you mix procedural and static content? Just the idea of the wiki is interesting here. Perhaps there could be a wiki that stores content in a static way, that is edited by users putting the best content they find on the procedural web. It would be a valuable place to find good ideas or ideas that we might not have thought of but someone else did. This could also serve as feedback for AI models. Although it is also true that we would not be able to distinguish if non-human opinions start to creep in and end up contaminating the site.

* How can you verify accuracy of information? I think this is one of the main difficulties, as the AI would have to understand contexts and have a notion of truth, I think this would already start to touch the capacity of "consciousness".

* What if you could refine a web page’s content just-in-time? You will be able to do this for every part that you don't like enough and want something better, or just to see something different.

Haha, this is an amazing concept. It feels like a satire or an art piece. I love it, but it kind of gives me the "is the world real?" feeling.
Simulations on simulations. How to distinguish the real?
Procedural generation is going to be fantastic for personalizing language and explaining ideas; it's pretty obviously the future for anything written that is learning & information oriented.

However I'm concerned these personalization systems will be too accommodating and just tell people what they want to hear. One way people use search is for motivated reasoning: I believe something, I search for it, I find confirming evidence. A procedurally generated system I imagine is especially prone to this kind of massaging of queries to get desired outputs - tweak a few input parameters in the form of a query, and out pops the answer you seek. It's a hard problem to test for.

Funny thing is when people search to find what they want to hear, they're often driven to content farms, and those are increasingly ML-generated. Seems this is just cutting out the middlemen!

In the event you think you're looking at a simulation of the Internet, maybe start out by checking if news, maps, and weather are realistic, to see how good their world simulation is. Live news video should be interesting too.

But if your browser is compromised so that encryption doesn't work, I think you have bigger problems.

The concept of simulation is very interesting, Baudrillard's Simulacra and Simulation and other related theories will gain more strength and meaning.
Is there any way to get access to a GPT-3 like API that can be run locally (color me ignorant, I know generating net for GPT-3 is huge but I have no idea how small the usable result can be stored so that usage can happen locally instead of to some cloud server
Public models like GPT-NeoX-20B need a minimum of 45GB of VRAM. That's two 3090s, (Maybe four, five grand, depending on how much effort you spend on bid sniping ebay auctions) or a single A100 80GB. ($20,000+)

Also note that NeoX-20B is pretty good, but it's not GPT-3 quality.

So basically modern Google without ads.
"When there is no monetization you are the product" (patent pending)
Exactly, content no longer revolves around monetization.
very cool / congrats! was recently tweeting about same -- the potential to help humans search the existing web better beyond just keyword search, i.e., query rewriting, summarizing and extending existing content, etc.

there's also the flip side around SEO spam, which is partly why founded Breeze, a newish topic search engine that leverages curation to hedge against the dark side of human / bot spam, etc.

bottom line, love this, having worked with GPT-3 in past and the direct impact on day job, all things search

Thanks man! It's a good idea, maybe a similar filter or curator will be needed for the procedural web, but for the dark part of the AI; disinformation, meaningless content, etc.
Does the full version of this require Strong AI to truly replace the internet? What level of AI is necessary to convincingly replicate human understanding and explanation of information?
It is something that is still not clear to me, I see the difficulty of the task but also the rapid evolution of AI models. Maybe it will surprise us in not too long.
Goopt is an experiment in what the "procedural web" could be. This new web will use procedural content generation to create varied content, completely synthetic, since these are generated by algorithms and artificial intelligence. For now, the content is only text that is being generated automatically with GPT-3, the recent OpenAI model. Goopt works as if it were a search engine, allowing us to search for any term, obtain related results and access their content. Simulating in this way the experience of browsing the web.

I started working on this because I think the idea of the procedural web is interesting, I think this technology will come in the future maybe not too far away, so we have to start thinking about the possibilities, problems, dilemmas, paradoxes. Well, I think this is a big change, which has a direct impact on the information we consume and how we do it, the online experience, it could even shape our behavior as it is our information, reference, entertainment system, etc. We must be ready for a change of this size, so I invite you to give life to this topic and put it up for discussion.

What is Procedural Web?

This is a term that has not yet been treated as such, so I have tried to give my vision on the matter. The procedural web will be the future of the web. It will offer us infinite content, since it will not be necessary that someone has written or created it before. All the content will be synthetic and generated at the moment, with infinite possibilities. From informative text, articles, images, videos, to games, applications and services with interfaces and functionality that are automatically generated. All this adapted to our queries and needs, and increasingly personalized to our preferences.

Web 4.0 could be the propitious evolution for the procedural web. The automation that this new version of the web poses about applications, services, interfaces, APIs, devices and others, could be exploited by connecting them with the procedural web. These reality data interfaces would help empower and enhance their generative capabilities, as well as connect their functionality to the real world, having services and devices on which to execute actions. This makes the procedural web even more interesting, because it endows it with cybernetic capabilities.

The procedural web is based on natural language processing (NLP) and procedural content generation (PCG). Advances in these fields, as well as in the field of computational creativity, will allow us to generate increasingly better synthetic multimedia, thus nurturing the procedural web with more and better content. It will be interesting to see if this content comes to satisfy us more than the traditional web and human creation. Maybe one day we won't ever be able to distinguish.

Demo video and usage guide in the GitHub repository:

https://github.com/jokenox/Goopt

Is there any use of ML to distinguish the AI generated dead internet from the real one?
For generated headlines humans are a coin toss, can't tell them apart. But transformers can reach 85% accuracy.

https://aclanthology.org/2021.nlp4if-1.1.pdf

Unfortunately, does not come with a Gwenyth-Paltrow-scented candle.