Hacker News new | ask | show | jobs
by ethbro 2349 days ago
> However, it will be interesting to figure the heuristics to deliver better quality search results today.

If only there were some kind of analog for effective ways to locate information. Like if everything were written on paper, bound into collections, and then tossed into a large holding room.

I guess it's past the Internet's event horizon now, but crawler-primary searching wasn't the only evolutionary path to search.

Prior to Google (technically: AdWords revenue funding Google) seizing the market, human-currated directories were dominant [1, Virtual Library, 1991] [2, Yahoo Directory, 1994] [3, DMOZ, 1998].

Their weakness was always cost of maintenance (link rot), scaling with exponential web growth, and initial indexing.

Their strength was deep domain expertise.

Google's initial success was fusing crawling (discovery) with PageRank (ranking), where the latter served as an automated "close enough" approximation of human directory building.

Unfortunately, in the decades since we seem to have forgotten how useful hand-currated directories were, in our haste to build more sophisticated algorithms.

Add to that that the very structure of the web has changed. When PageRank first debuted, people were still manually tagging links to their friends' / other useful sites on their own. Does that sound like the link structure we have in the web now?

Small surprise results are getting worse and worse.

IMHO, we'd get a lot of traction out of creating a symbiotic ecosystem whereby crawlers cooperate with human currators, both of whose enriched output is then fed through machine learning algorithms. Aka a move back to supervised web search learning, vs the currently dominant unsupervised.

[1] https://en.m.wikipedia.org/wiki/World_Wide_Web_Virtual_Libra... , http://vlib.org/

[2] https://en.m.wikipedia.org/wiki/Yahoo!_Directory

[3] https://en.m.wikipedia.org/wiki/DMOZ , https://www.dmoz-odp.org/

4 comments

Mixing human curation with crawlers is probably something that'd help with search results quality, but the issue comes in trying to get it to scale properly. Directories like the Open Directory Project/DMOZ and Yahoo's directory had a reputation for being slow to update, which left them miles behind Google and its ilk when it came to indexing new sites and information.

This is problematic when entire categories of sites were basically left out of the running, since the directory had no way to categorise them. I had that problem with a site about a video game system the directory hadn't added yet, and I suspect others would have it for say, a site about a newer TV show/film or a new JavaScript framework.

You've also got the increase in resources needed (you need tons of staff for effective curation), and the issues with potential corruption to deal with (another thing which significantly effected the ODP's usefulness in its later years).

Federation would help with both breadth and potential corruption, compared to what we had with ODP/DMOZ. A federated Web directory (with common naming/categorization standards, but very little beyond that) would probably have been infeasible back then simply because the Internet was so much smaller and fewer people were involved (and DMOZ itself partially made up for that lack by linking to "awesome"-like link lists where applicable) - but I'm quite sure that it could work today, particularly in the "commercial-ish" domain where corruption worries are most relevant.
The results are human curated as much as google would like to publicly pretend otherwise.

I think a more fundamental problem is a large portion of content production is now either unindexable or difficult to index - Facebook, Instagram, Discord, and YouTube to name a few. Pre-Facebook the bulk of new content was indexable.

YouTube is relatively open, but the content and contexts of what is being produced is difficult to extract, if, for the only reason that people talk differently than they write. That doesn’t mean, in my opinion, that the quality of a YouTube video is lower than what would have been written in a blog post 15 years ago, but it makes it much more difficult to extract snippets of knowledge.

Ad monetization has created a lot of noise too, but I’m not sure without it, there would be less noise. Rather it’s a profit motive issue. Many, many searches I just go straight to Wikipedia and wouldn’t for a moment consider using Google for.

Frankly I think the discussion here is way better than the pretty mediocre to terrible “case study” that was posted.

Immediately before Google were search engines like AltaVista https://en.wikipedia.org/wiki/AltaVista (1995) and Lycos https://en.wikipedia.org/wiki/Lycos (1994) which were not directories like Yahoo. Google won by not being cluttered with non-search web portal clutter, and by the effectiveness of PageRank, and because by the late 1990s the web was too big to be indexed by a manually curated directory.
"Halt And Catch Fire" had a cool way of taking these 2 approaches of search into their plot line.