Hacker News new | ask | show | jobs
by mattnewton 1579 days ago
> The procedural web will be the future of the web.

Isn't the "procedural web" built of mountains of (hopefully) human written content? How will the system get content about new subjects without the humans writing it? Isn't a system like GPT-3 currently limited to reflecting the ground truth data it has seen?

2 comments

For how long? Think of the marketing and censorship opportunities when you can directly tune not just the content that gets seen but also the content itself! Content is still at least somewhat robust to censorship as it's sometimes difficult to remove all references to a banned book. Imagine if banning content also automatically rewrote all references such that they no longer made reference to the content? Or if one could simply pay and have all reviews of a mediocre book changed to make it the greatest book ever?

Note the above is a statement on some of the risks to a procedural web. Not a real market opportunity.

You'd have to use a trusted language model to get you banned information.
It's a good point. It's a total control of the word, in addition to being limited to only consuming, we cannot comment on our opinion or refute the lie.
> Isn't a system like GPT-3 currently limited to reflecting the ground truth data it has seen?

This limitation went away recently. A variant called RETRO (Retrieval-Enhanced Transformer) can use a search engine to take in the exact information up to date [1], assuming you can curate your own text corpus. It's also 25x smaller.

[1] https://deepmind.com/research/publications/2021/improving-la...

That's really cool. But unless I am misunderstanding this, that still puts the burden on the existing web though right, it's just avoiding having to retrain the model? If there is no economical market for humans to produce new content about a topic how will the search engine find the "ground truth" content?
You might want to use a limited subset of the web, a curated list of sources or feeds. Apparently 1TB of text could be enough, just need to collect it or download it from a trusted source.
So, suppose there is a new kind of cocktail that is popular in bars near me that nobody has written about under it's new trendy name.

How do I ask this system about the recipe, or the history of the cocktail? Someone has to write an article about it, right? How do they get paid if it gets scraped once and people go to the scraping model for the answer instead of visiting the original article's page?

Give it two years and we might have passable agents running on phones. There'll be a sufficiently powerful and small model that you can use with 8gb ram or less on desktop within a year.

These first large language models are naive, unoptimized implementations of data structures we're learning to inspect and optimize. Something like retro that runs locally with a "just clever enough" service agent is so close to workable. I can't wait to see what happens in ML over the next two years, and who knows what kind of radical evolution the next big algorithm is going to bring.

Oh I totally see that, the issue I'm talking about isn't one of compute, but of high quality ground truth. This machine can hallucinate all kinds of information in perfect English already. The difficulty is that a good search engine needs to return more than just information that matches my query, it should return information that matches the objective reality people (and currently not the machine) inhabit. The machine needs text input to learn about the world; is the future going to look like companies hiring people to write essays about the world for machine consumption?

I think it's a similar problem we see today with ad-supported news being indexed by search engines, but taken to another magnitude when those articles need to be scanned by a model only once to have near perfect recall of the details.