Hacker News new | ask | show | jobs
by jstx1 1292 days ago
I don’t think so. Google is still a search engine first and a question answering machine second. And for the question answering I will still prefer links over a blob of text that can’t be inspected or verified.
2 comments

Plus, who is going to produce the corpus you feed the magic chat engine?
As everyone starts to adopt AI, are we going to get to a point where the AI is eating itself. I could imagine AI failing similarly to incestuous genetic lines creating mutations.
Yep, as AI starts to get trained on AI-generated data the output may well become unstable, you can't build an infinite motion machine (or an infinite gain machine/infinite SNR amplifier) and the system may degrade to essentially white noise.

Sort of a cyber-kessler syndrome basically. You really don't want AI-generated content in your AI training material, that's actually probably not generating signal for building future models unless it's undergone further refinement that adds value. An artist iterating on AI artwork is adding signal, and a bunch of artist-curated but not iterated AI artworks probably adds a small amount of signal. But un-refined blogspam and trivial "this one looks cool" probably is reducing signal when you consider the overall output, the AI training process is stable and tolerant to a certain degree of AI content but if you fed in a large portion of unrefined second-order/third-order AI content you would probably get a worse overall result.

Watermarking stable diffusion output by default is an extremely smart move in hindsight, although it's trivial to remove, at least people will have to go to the effort of doing so, which will be a small minority of overall users. But it's a bigger problem than that, you can't watermark text really (again, unless it's called out with a "beep boop I am a robot" tag on reddit or similar) and you can already see AI-generated text getting picked up by various places, search engines, etc. This is the "debris is flying around and starting to shatter things" stage of the kessler syndrome.

In the tech world, you already see it with things like those fake review sites that "interpolate" fake results without explicitly calling it out as such... people do them because they're cheap and easy to do at scale and give you an approximation that is reasonable-ish most of the time for hardware configurations that may not be explicitly benched... now imagine that's all content. Wanna search for how to pull a new electrical circuit or fix your washing machine? Could probably be AI generated in the future. Is it right? Maybe...

Untapped sources of true, organic content are going to become unfathomably valuable in the future, and Archive.org is the trillion-dollar gem. Unfortunately, much like tumblr, if anybody actually buys it the lawyers are going to have a fit and make them delete everything and destroy the asset, but, archive has probably the biggest repository of pre-AI organic content on the planet and that is your repo of training material. Probably the only thing remotely comparable is the library of congress or google's scanning project, but those are narrower and focused on specific types of content. You can generally assume almost all content pre-GPT and pre-stable diffusion is organic, but, the amount of generated content is already a significant minority if not the majority of the content. Like the kessler syndrome, you are seeing this proceed quickly, it is hitting mass-adoption within a span of literally a few years and now the stage is primed for the cascade event.

The other implication here is, people probably need to operate in the mindset that there will be an asymptotically bounded amount of provably-organic training content available... it's not so much that in 10 years we will have 100x the content, because a lot of that content can't really be trusted as input material for further training, a lot of it will be second-order content or third-order content generated by bots or AI and that proportion will increase strongly over the next decade. That's not an inherent dealbreaker, but it probably does have implications for what kinds of training regimes you can build next-next-gen models around, the training set is going to be a lot smaller than people imagine, I think.

Thirteen years ago I met a traveller who paid their way with travel writing, which was basically blog spam. They soon ran out of authentic material so they started writing about places they'd never been using some light googling for inspiration. For a long time now people have been making advertising money by creating bullshit on a large scale. How are you going to prove that any content is organic?
you ultimately can't, and there are certainly degrees of "organicness" even among organic content - a lot of content is essentially infomericals or arguments shilling a particular perspective they have a financial interest in shilling. And of course there's the case like the wikipedia editor who completely made up like 75% of the scottish wikipedia articles that have been the training inputs for language translation models etc, that is very organic content but it also is actually poison to train on!

The good news is the internet is relatively good at routing around the shit, for now. And I guess de-facto that is something you could apply to your content inputs: what's the pagerank for this content? actual pagerank, not the advertising/engagement bullshit that the search model has turned into. If the AI generated stuff is correct enough that it has a high pagerank, maybe it's correct enough to be used as an input.

but the thing is honestly there's already been an uptick in ML or AI-generated content that is already surfacing in searches and other places and it's not always correct... and honestly the relevance of google's search results has been noticeably decaying for 10+ years now. Things I know are out there and are relevant are not being surfaced anymore. Is AI generation contributing to that problem? Maybe. Probably not helping, at least.

What seems most likely is that OpenAI and other LLM trainers are going to proceed to training on transcripts of YouTube videos and podcasts using the Whisper text-to-speech model, which at its largest sizes is really quite state-of-the-art. For now, it seems like most of this content is still organic (or if it's not, the computer-generated speech is relatively easy to distinguish for now).
It seems that most famous Stable Diffusion WebUI doesn't actually implement watermark. https://github.com/AUTOMATIC1111/stable-diffusion-webui/issu...
Am I alone in not being sure if the commenter here fed the parent into GPT as a prompt to generate output or actually wrote this?
Afraid not, I actually wrote all that shit...
Are we not already there?
> Google is still a search engine first

The web has eroded to a place where a few platforms contain most of the salient information for consumers.

Maybe you're right. But I'm not convinced.

I feel like the mass centralization of content is starting to unwind a bit. As things scale the generalized sources usually become less valuable to me. With more content comes more noise, and that noise is hard to sift through. And while Google isn't perfect, they're better at sifting through this noise than most sites are.

Take StackOverflow as an example. When it first emerged I found it really useful. Answers were generally high quality. There were valuable discussions about the merits of one approach versus another. Now it's a sea of duplicate questions, poor answers and meandering discussions. I rarely visit it anymore, as it's rarely helpful. And I regularly have to correct information others glean from it, as it's often wrong or incomplete.

So I suppose this all goes to say that I'm optimistic that things are headed in the right direction. I imagine things will ebb and flow for some time. But I believe Google and other search engines will always have a role to play, as there will always be new, valuable things to discover.