Hacker News new | ask | show | jobs
by Quenhus 1432 days ago
Here is my uBlock filter with hundreds of GitHub/StackOverflow copycats: https://github.com/quenhus/uBlock-Origin-dev-filter

It blocks copycats and hide them from multiple search engines. You may also use the list with uBlacklist.

4 comments

With these two pieces of data:

* the identical text copied from stack overflow should be easily identifiable

* volunteers put together a list of these sites themselves

it should be obvious to Google apoligists that Google is either negligent or intentionally allowing these sites in their search. I'm sick of hearing about how "the world is different" and it's an "arms race" between spam sites and google. Bullshit.

> the identical text copied from stack overflow should be easily identifiable

Google starts matching content from SO => Spammers start tweaking the text slightly => google implements some expensive similarity score to down rank copy cat sites => spammers use more complex scrambling=> ...

> volunteers put together a list of these sites themselves

These lists only work because they're used by a tiny minority of people. If Google were to do this the spammers would start switching domains more quickly (or find some other workaround).

I'm no Google apologist but I think you're underestimating how hard search ranking is when spammers are actively trying to game the system.

> tweaking the text slightly

That's what ML is perfect at detecting, which is Google's forte.

Some of these sites have been returned as top results for a while, so are you suggesting that Google just gave up because spammers would be able to evade them with an update?

Yes it is arms race, google has far more resources than spammers do so they should be ahead easily.

You underestimate the resources google has at its disposal.

They simply don’t care because there is no real competition to worry,even with this spam you are still likely to use google, so why would profit motivated company bother ?

SO seem to have Yahoo ads, so I guess it is a no brainer for Google to rank sites they profit from over the content the lusers want.
This is the real answer.
The problem with these theories is that they lack any sensible explanation of motive. Google intentionally degrading its search results because they "earn more if the user has to search again and again" just doesn't feel right: even if it were true in some short-term experiment, it would compromise the way people at Google think of themselves and their work to a degree that would be devastating to the company. There is no way they would throw away that sort of value without being under intense pressure, which they definitely are not.
Another comment stated that SO uses ads from someone else than Google, while the copy-paste sites use Google for ads. If true, that is clear monetary incentive to not go after this too hard.
They've also demonstrated that they can derank the Wikipedia clones. Funny how that ability is lost when the site in question makes money for a competitor.
These large tech companies have a long and varied history of stupid short-term decision making for profit and bad products due to local individual failures. Until there is a clear and detailed explanation of how the spam sites are avoiding google's wrath, the explanation of stupidity or short-term thinking on Google's part seems just as plausible.
Well come up with an explanation of how these entirely mechanically generated SO clone sites, with no obfuscation, are allowed to exist by Google, when identifying them and removing them should be fairly trivial?

At the very least they're being deliberately neglectful because they don't feel the bad experience harms their revenue because there's no other substantial competitor so they can abuse their monopoly status.

I guess they may just not care enough about software developers and figure we're mostly using ad blockers so its wasted effort and we'll develop blocklists ourselves. With no monetary value that they can assign to the ill will that it engenders they figure it must not matter so they don't bother. Pissing off a large chunk of the entire IT community via obvious neglect seems like a poor move to me, but then I've never felt that I'm cut out for management.

Maybe the problem is just genuinely hard and beyond their capabilities.
Detecting identical snippits of text is beyond virtually no one's abilities.
Yeah, I subbed to the blocklist that someone else published that they're maintaining manually. Google certainly has the resources to beat that bar.

It feels like economy-wide that decision makers in corporations and governments have just arrived at the conclusion that there's no money / no point in trying to stop scammers (and there might be an actual cost to revenue of doing so). It won't goose their quarterly numbers and might hurt them so its better to allow it.

This even works on Firefox Nightly on Android. Thanks a lot!
This is fantastic! This is exactly what I needed, thanks!
You rock. Thank you.