Hacker News new | ask | show | jobs
by Barrin92 1947 days ago
you don't need any NLP to rank webpages (in fact the entire innovation of Google was that they figured out a way to rank pages completely ignoring that fact). Pagerank works fundamentally by treating the web as a graph and prioritising results based on their connections, that is to say it ranks based on popularity and is agnostic about the content of the actual page.

This generally has worked well. On the other hand, actually attempting to manipulate search results based on automated handling of content is what has given us countless of censorship debates or simply failure where even uncontroversial content is removed or downranked because it violated some sort of strange rule because it had a 'bad word' in it. On Facebook recently clothing ads for the disabled people were banned[1], because turns out the ML system only cared about the wheelchair, not the person in it.

It's actually fairly straight-forward to build recommender systems on transparent, graph-based algorithms and it gives you the added advantage of not discriminating in strange ways.

[1]https://www.nytimes.com/2021/02/11/style/disabled-fashion-fa...

3 comments

You've just skipped over the early days of Google where they relied primarily on PageRank and bad actors manipulated it to death.

It's trivial to generate webs of fake, inter-related content and use that specifically to feed incoming links to valuable pages. Or to comment-spam websites so aggressively it ruins them. Or all of the secret deals between high-ranking sites to feed links even though the sites weren't related. There are countless examples of black-hat techniques to break PageRank.

I am sorry but you simply can't build a sustainable search engine without deeply understanding the user intent and the meaning behind the indexed pages.

>There are countless examples of black-hat techniques to break PageRank

there are also countless of adversarial examples to trick ML algorithms. In fact this is in many ways worse because of the 'idiot savant' character of ML systems, which are almost always oblivious to context and can be tricked in ways that aren't apparent from the design of the system.

In contrast to systems that are legible or even formally verifiable ML systems are entirely unable to provide any guarantees. When someone breaks pagerank at least it's apparent how they broke it. When an ML system mistakes a turtle with a fractal pattern on its shell for a gun nobody knows how to fix the system in any reliable way, other than feed it more data and pray.

Pagerank worked fine when it was invented. It's a very elegant algorithm. But in a perfect illustration of Goodhart's law, it fell apart once people realized that they could game it to increase their traffic. Google has been in a constant arms race against unscrupulous SEO practices ever since.
>Google has been in a constant arms race against unscrupulous SEO practices ever since.

One company controls 80% of what is found on the internet. They set rules, restrictions, penalties that are not public. They do not pass any sort of regulatory muster. They rip and tear through businesses standing in their way. They crush out a person's online existence through never explained reasons. They use every advantage they can to tweak a human's emotions, drive and needs to feed more and more advertisements.

You suggest those trying to use every advantage they can to rank higher unscrupulous?

Google's fight to keep search results crisp ended soon after they began selling advertising. Google long ago quit innovating search to be better for people, they've made it better for advertisers.

what is the weather today, Google?

I agree that you don't need NLP to rank webpages (though it certainly helps), but you do need it to parse the kinds of queries given to search engines these days. The days of logical OR and NOT are long gone I'm afraid.

> It's actually fairly straight-forward to build recommender systems on transparent, graph-based algorithms and it gives you the added advantage of not discriminating in strange ways.

I think other commenters have addressed the PageRank issue, but I'd be super interested in papers doing the work you note above.