Hacker News new | ask | show | jobs
by TLLtchvL8KZ 1581 days ago
I come across these sites so often it's not even funny.

Different website. Different title. Exact same content. 4 or 5 in the first page of search results.

I'm assuming they're all ran by the same person, throwing as much ** at the wall knowing some will stick.

Many of my searchers now include "reddit" or "forum" at the end to filter out all the spam/crap.

5 comments

This feels like the underlying issue. Google may have stayed the same, or even slightly improved.

But the web, in the sense of quality:crap ratio, has gotten substantially worse.

This flood seems like the ultimate manifestation of turnkey hosting solutions.

Imho, we could do worse than reviving an idea from email's early days vs spam: negligible per-use charging. The idea was to tax emails at $0.0001 (or somesuch). Insignificant for actual users, but financially decimates high-volume, low-value spammers.

The web is like that because content farms are optimizing the pages to be found by Google and Google doesn't know how to filter them out, so we really can't treat it as a problem independent of Google itself.
It isn't so much that 'google doesn't know how to filter them out' but there's nothing left after having filtered them out.

Nobody is producing real content that isn't behind a paywall.

There's nothing to find.

There is real content being drowned out by autogenerated SEO crap. I tried looking for rice cookers which didn't have non stick coatings, they do exist and some blog posts which talk about them but the top results are all stores which have just a generic category for rice cooker but generate 200 duplicate pages with the title changed to exactly match whatever your search term is. So it says "Ceramic rice cooker" but shows their generic listing of PTFE rice cookers.

Google search is constantly improving but the SEO spammers are improving faster.

I know this isn't true because I've used google for nearly two decades with good results. That information hasn't evaporated, it's just buried.
Fair. The fact that Google exists + the fact that Google serves a huge amount of traffic + the fact that Google is unable / unwilling to filter out content farms = incentive to content farm.

If there were no Google though, we'd likely have the same thing.

So I guess the only reality that avoids incentivizing them is one where (1) there is a massive traffic generator & (2) that massive traffic generator severely disincentizes content farms.

In theory proof-of-work with increasing cost based on subjective untrustworthyness might work.
This has been happening a lot with StackOverflow and GitHub pages lately. A lot of the times, the actual GitHub or SO link won't even be on the first page.

I'm surprised they haven't done some kind of manual pruning of junk like that, or maybe they have and it's not working... but on the surface it totally seems like they could implement something that says "GitHub has content X, and these other 10 sites are 99% the same, but we've flagged GitHub as an authoritative source so they'll always outrank the clones".

Maybe it's a fear of appearing unfair. Or maybe they secretly want to hurt Microsoft by turning a blind eye. Or maybe this is actually a much harder problem. If I had to guess it's probably #3. But as a user of search it's frustrating to find the clones ranked above the real stuff.

Can’t they just look at where they first encountered the copied content?
Yup, just found this morning that an article my wife wrote on a very obscure legal topic was stolen, reformatted, and posted on some "life hacks" sort of site. It shows up #3 in the DDG results. At least her originals are still #1 and #2.

Meanwhile I have in my inbox in the last 24h at least a half-cozen emails looking to do SEO work for my company website.

Web = untrustworthy? YUP

I'd happily pay for a serious version of 1999 Google, but updated to filter out anything advert based, and search for exactly what I want.

Search is such a fundamental function, and we've done the experiment and the advert model fails - it needs to be just another utility.

"Different website. Different title. Exact same content. 4 or 5 in the first page of search results."

If only google was smart enough to figure this out

If you put site:reddit.com it Google will only return results from reddit.com
Not even this is a guarantee. This happens to me regularly:

https://twitter.com/jdgoesmarching/status/149367886211437772...

Interesting! I get all Reddit results on my laptop (Safari), but on my Android (Edge and Chrome) I get Good Housekeeping and NYTimes before Reddit. Some kind of ad? Though when I click on the three dots next to the result, it says it's not an ad. Odd.

Any Googlers want to chime in?

Yes, there's a particular bug that sometimes happens on mobile with some queries. It's relatively recent; we're aware and working to resolve it, because site: really should only show content from the indicated site.
I’ve had it for at least six months, almost always around product-related queries.