Hacker News new | ask | show | jobs
by holografix 1428 days ago
There’s been some debate about this here on HN and someone made a point that resonated with me:

The quality of _the web_ has been in decline lately.

With ML’s capacity to paraphrase original content and to generate plausible rubbish content from scratch, it’s very difficult for Google’s pagerank (or whatever they call their algo these days) to fight back.

That been said, there does seem to be a fair bit of scraping and paste going on. I’m surprised G is not looking at published dates and lowering the the scammers ranking.

3 comments

Also the quality of hyperlinks has declined precipitously. Google's big insight was that what people say about a page matters more than what the page says itself. Hyperlinks were fundamental to the (pre-2013) ranking algorithm, in ways that were far more fundamental than just pagerank. This worked really well when people authored HTML pages by hand (or in FrontPage or DreamWeaver) and would promiscuously link out whenever another site was relevant.

It works really poorly when all the links are paid for, or bargained for, or part of social media sites that are overrun with spammers and use rel=nofollow anyway, or are internal-only because every site wants to be its own walled garden. That's the web we've got now.

As well as the fact fewer ppl are maintaining their own websites or blog. The content is now in places that cant be crawled as easily like Instagram, Tiktok and Twitter as well as FB groups.
The loss of blogs is so huge in all sorts of ways.

Some random person's travel blog in 2008 would be full of genuinely useful information. That's gone now, replaced by regular people just posting pics to social media and then affiliate-link filled seo-optimized travel sites, half the time written by someone who hasn't even gone to the place but is just copying info and adding sponsored content.

The same is true for essentially everything- gardening, video games, books, bikes, bird watching, baking. The genuine amateur enthusiast content isn't published on the open web in accessible text. It's locked in walled gardens or just never created in the first place, with people choosing to post a couple pictures or videos rather than write a blog post about it.

I used to ridicule geocities and tripod, but I sure miss all those niche sites now.
Look at python documentation. There is a metric ton of spam sites that copy and paste python documentation with ads inserted. Google ranks these higher than the primary source.

I suspect google is boosting ad supported content over non ad supported content. Directly incentivizing paraphrased/copy pasta content.

I see this with questions and answers from reddit or other forums which get syndicated into various other 'developer' sites and get high rankings on Google.

Search engines should let us configure a whitelist of sites for certain categories/context of search.

Which can be so frustrating when a bad answer gets propagated this way.

I've had times where the same "bad" information (whether completely wrong, incomplete, misleading, not best-practice, confusing, whatever manner of "bad") showed up on multiple different sites all on the first page of Google, often clearly copied from one another or the same original quora/stackoverflow/whatever.

I'm pretty sure that Google search page rank, ranks sites with Google ads HIGHER than the same site without any ads. Of course they would, it gives them better metrics to show their Google ads are effective and they can charge more.
I think it's plausible that there are types of spam Google can't fight against, but this ought to be possible (which supports the "malice" theory).
Is it the quality of the web that has been declining, or is our ability to find quality information on the web (due to Google and friends) affecting our perception of the web?

I'm on the fence on this.

Contributors and commenters on HN manage to surface many interesting sites that would be difficult to find using a modern search engine. There is also a large number of old sites that have continued to exist, even if they are otherwise unmaintained, which are also difficult to find using a modern search engine. On the other hand, these sites may be less common than they used to be.

Likewise, scraped and pasted websites are nothing new. It has always been a relatively low effort way to post content. What seems to be new is how often nearly identical pages appear in the top search results. This could be because it is more common, but it may also be because the algorithms are favouring very particular types of content.