| > However, it will be interesting to figure the heuristics to deliver better quality search results today. If only there were some kind of analog for effective ways to locate information. Like if everything were written on paper, bound into collections, and then tossed into a large holding room. I guess it's past the Internet's event horizon now, but crawler-primary searching wasn't the only evolutionary path to search. Prior to Google (technically: AdWords revenue funding Google) seizing the market, human-currated directories were dominant [1, Virtual Library, 1991] [2, Yahoo Directory, 1994] [3, DMOZ, 1998]. Their weakness was always cost of maintenance (link rot), scaling with exponential web growth, and initial indexing. Their strength was deep domain expertise. Google's initial success was fusing crawling (discovery) with PageRank (ranking), where the latter served as an automated "close enough" approximation of human directory building. Unfortunately, in the decades since we seem to have forgotten how useful hand-currated directories were, in our haste to build more sophisticated algorithms. Add to that that the very structure of the web has changed. When PageRank first debuted, people were still manually tagging links to their friends' / other useful sites on their own. Does that sound like the link structure we have in the web now? Small surprise results are getting worse and worse. IMHO, we'd get a lot of traction out of creating a symbiotic ecosystem whereby crawlers cooperate with human currators, both of whose enriched output is then fed through machine learning algorithms. Aka a move back to supervised web search learning, vs the currently dominant unsupervised. [1] https://en.m.wikipedia.org/wiki/World_Wide_Web_Virtual_Libra... , http://vlib.org/ [2] https://en.m.wikipedia.org/wiki/Yahoo!_Directory [3] https://en.m.wikipedia.org/wiki/DMOZ , https://www.dmoz-odp.org/ |
This is problematic when entire categories of sites were basically left out of the running, since the directory had no way to categorise them. I had that problem with a site about a video game system the directory hadn't added yet, and I suspect others would have it for say, a site about a newer TV show/film or a new JavaScript framework.
You've also got the increase in resources needed (you need tons of staff for effective curation), and the issues with potential corruption to deal with (another thing which significantly effected the ODP's usefulness in its later years).