Hacker News new | ask | show | jobs
by krackers 535 days ago
>that forum post no longer exists and no longer shows up in search results

I dream of someone taking the internet archive data, capping it at 2010 or so, then making a search engine out of it. I mean if AI companies are looking to gobble all the data they can get, then surely they'd jump at the chance to train on (higher quality) data from the past that simply no longer exists on the web. So it'd seem like a win-win situation if IA gave them a copy of the data on the condition that they maintain a permanent backup and provide some sort of searchable index on the data (maybe even via LLM), and in turn the AI companies got access to high quality data on obscure topics that simply no longer exists.

2 comments

Yup, let's not tie such an important endeavor to AI and AI startups though, we need something robust and lasting :-)
> I dream of someone taking the internet archive data, capping it at 2010 or so, then making a search engine out of it

It sounds like you're describing CommonCrawl.org, and yes, it's already popular with AI companies.