| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 2349 days ago
	The Common Crawl is a thing already. Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage, and I can't think of anything that could change that in the foreseeable future. That's why I think providing a federated Web directory standard, ala ODP/DMOZ except not limited to a single source, would be a far more impactful development.

3 comments

reaperducer 2349 days ago

Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage

Maybe instead of a problem, there is an opportunity here.

Back before Google ate the intarwebs, there used to be niche search engines. Perhaps that is an idea whose time has come again.

For example, if I want information from a government source, I use a search engine that specializes in crawling only government web sites.

If I want information about Berlin, I use a search engine that only crawls web sites with information about Berlin, or that are located in Berlin.

If I want information about health, I use a search engine that only crawls medical web sites.

Each topic is still a wealth of information, but siloed enough that the amount of data could be manageable to a small or medium-sized company. And the market would keep the niches from getting so small that they become useful. A search engine dedicated to Hello Kitty lanyards isn't going to monetize.

link

LargoLasskhyfv 2349 days ago

I´d be happy with something like Searx [1,2,3]

[1] https://en.wikipedia.org/wiki/Searx [2] https://asciimoo.github.io/searx/ [3] https://stats.searx.xyz/

featuring the semantic map of [4] https://swisscows.ch/

incorporating [5] https://curlie.org/ and Wikipedia and something like Yelp/YellowPages embedded in Open Streetmaps for businesses and points of interest, with a no frills interface showing the history (via timeslide?) of edits.

Bang! Done!

link

zozbot234 2349 days ago

That's the problem that web directories solve. It's not that you're wrong, it's just largely orthogonal to the problem that you'd need a large crawl of the internets for, i.e. spotting sites about X niche that you wouldn't find even from other directly-related sites, and that are too obscure, new, etc. to be linked in any web directory.

link

reaperducer 2349 days ago

That's the problem that web directories solve

Not really. A web directory is a directory of web sites. I can't search a web directory for content within the web sites, which is what a niche search engine would do.

link

Beldin 2349 days ago

On the other hand, the niche search engine depends upon having such a web directory (the list of sites to index).

link

teddyh 2349 days ago

Like WAIS?

https://en.wikipedia.org/wiki/Wide_area_information_server

link

HomeDeLaPot 2349 days ago

Don't forget the search engine search engine!

link

chongli 2349 days ago

You don’t really need to store a full text crawl if you’re going to be penalizing or blacklisting all of the ad-filled SEO junk sites. If your algorithm scores the site below a certain threshold then flag it as junk and store only a hash of the page.

Another potentially useful approach is to construct a graph database of all these sites, with links as edges. If one page gets flagged as junk then you can lower the scores of all other pages within its clique [1]. This could potentially cause a cascade of junk-flagging, cleaning large swathes of these undesirable sites from the index.

[1] https://en.wikipedia.org/wiki/Clique_(graph_theory)

link

anoncake 2349 days ago

Javascript, which Google coincidentally pushed and still pushes for, doesn't exactly make the web easier to crawl either.

link