|
|
|
|
|
by Barrin92
1947 days ago
|
|
you don't need any NLP to rank webpages (in fact the entire innovation of Google was that they figured out a way to rank pages completely ignoring that fact). Pagerank works fundamentally by treating the web as a graph and prioritising results based on their connections, that is to say it ranks based on popularity and is agnostic about the content of the actual page. This generally has worked well. On the other hand, actually attempting to manipulate search results based on automated handling of content is what has given us countless of censorship debates or simply failure where even uncontroversial content is removed or downranked because it violated some sort of strange rule because it had a 'bad word' in it. On Facebook recently clothing ads for the disabled people were banned[1], because turns out the ML system only cared about the wheelchair, not the person in it. It's actually fairly straight-forward to build recommender systems on transparent, graph-based algorithms and it gives you the added advantage of not discriminating in strange ways. [1]https://www.nytimes.com/2021/02/11/style/disabled-fashion-fa... |
|
It's trivial to generate webs of fake, inter-related content and use that specifically to feed incoming links to valuable pages. Or to comment-spam websites so aggressively it ruins them. Or all of the secret deals between high-ranking sites to feed links even though the sites weren't related. There are countless examples of black-hat techniques to break PageRank.
I am sorry but you simply can't build a sustainable search engine without deeply understanding the user intent and the meaning behind the indexed pages.