Hacker News new | ask | show | jobs
by MicahKV 1498 days ago
So spammers have latched onto your search engine because they are getting useful results. They are able to systematically discover websites built on certain platforms that allow users to post content containing links, which they can target for link spam. It is very difficult to fight this on a technical level because there is an entire industry built around blackhat SEO, with all kinds of softwares and services dedicated to thwarting your defensive efforts. Even Google struggles to keep up with this.

However, they are also systematically feeding you their footprint lists. I imagine you could put together a footprint blacklist pretty quickly, and just stop returning results for any obvious spam queries like those containing "powered by wordpress".

It's not a very elegant solution I'll admit. It won't stop the bots from trying, and you may have to circle back periodically to add new footprints as they surface. But it's a potentially quick and easy way to stop rewarding their efforts, and the blackhat world is pretty used to burning out their resources so hopefully they will figure out it's a dead end and move on.

7 comments

> So spammers have latched onto your search engine because they are getting useful results.

I'm not sure about this. At least with my search engine, it doesn't really seem to matter what response they get, I don't even think they look at the responses. They keep hammering away with tens of thousands of queries per day with the requests even though they've seen nothing but HTTP Status 403 since last October or so.

My best guess is they're going after search engines in general in case they forward queries to google, in order to manipulate their typeahead suggestions.

Put a CloudFlare web application firewall at the front of the site and then use its rate limited / CAPTCHA features to throttle traffic. It is the easiest way to get rid of parasitic scraping and API abuse. Cost is $0.
Yeah, that's essentially what I've done, except I'm paying for their cheapest non-free tier to have a bit more control over it. I really wish I didn't have to route all my traffic through an untrusted a 3rd party like that, but I guess we can't have nice things on the Internet anymore.
> I guess we can't have nice things on the Internet anymore

Not since it left the larval stage and became "pay for play", no.

Oh, well, those taxpayer-funded years were nice for those of us who were around.

I think I remember wondering, after the dotcom bust, if the whole web thing would actually take off.

The reasoning I vaguely remember reading was that the internet required government subsidy to exist - at first directly, then in the form of universities, and the bust was a sign that it couldn't exist without one.

I don't remember how prevalent the view was at the time though. Obviously it turned out to be wrong.

Putting authentication on the site would be easier.

There a rub here, in that people expect to search things without being logged in. But then if you don't log in people, anyone can come calling, including bots. This then causes you to do things like get a third party to filter the data, which then affects the users by having to reroute their traffic to someone else to get rid of some of the visits you don't want from the bots.

And round and round.

Simple authentication to the site with tokens might solve the problem. If an IP comes calling that does so with out authentication, or payment, then hang the connection.

> Cost is $0.

Cost is the slow enclosure of the internet hy a handful of giant companies and once attestation is universal having anyone without a locked down device be locked out of most of the internet without providing endless free labour.

FTFY

Huh, well I guess there goes my theory about the incentive. What a bummer. I would have thought that at least with search engine scraping, they would stop expending the effort once the results dried up.
Or put those query results behind an anti-bot/"capcha" test.
That would probably help, but it's also a continuation of the cat and mouse game. There are plenty of captcha breaking services out there, it only cost about $1 to programmatically solve 1000 captchas.
> There are plenty of captcha breaking services out there

Give it a try and see what happens.

People said greylisting against email spam wouldn't work, since spammers would just resend. It works since 20 years. To get your IP off the DNSBL NiX Spam you just have to follow a link. People said spammers would automate that process. Never happened in 19 years. Sometimes spammers are just lazy.

Sure, but it increases friction that forces a re-eval of cost/benefit of the bot(s).

Newest captcha services are a prediction score, not even a verification screen, and you can feed polluting data to bots you are certain to exist.

Agreed. I suspect that this is an arbitrage game on the part of the SEO spammers. Each search is cheaper for them than it is for a competitor who's using a major search engine with more extensive anti-spammer protections, and that difference equals $$$. A captcha doesn't have to be an unbeatable solution. It just has to provide enough of a barrier to equalize the cost.
I'm not so sure about this. The spammers goal is to build up as big a list of link spam targets as possible. If one spammer chooses to only scrape minor engines and another only major engines, the one scraping the major engines will probably come out on top despite the higher cost. Whoever is abusing OP's search engine is likely doing it to supplement the data they are already scraping from the major engines.

For OP, I think simply not returning results at all is a more practical measure because it removes the reward completely. Captchas and bot detection keep the reward in play, while taking away the results entirely makes the entire pursuit futile.

It might be a better idea to return low quality results than nothing at all. The idea is that it's pretty obvious when the bot is banned when it receives no results at all. Having to look at the results manually to determine whether one is banned is a much more time consuming endeavor.
Deliberately feeding the spam bots into an endless loop of captchas might slowly drain their accounts if they are paying 3rd party captcha farms.
As I understand it, the main point of CAPTCHAs isn’t to keep out bots completely, but to give enough friction to make automated attacks or uses infeasible, while keeping the friction low enough that normal users can still use it normally.
... and there are the "click farms" with human beings.
If someone pay people to collect data you could outright sell the data to them.
Captcha breaking is SO easy these days; even the modern captchas are easy to defeat.
How about serving bots with one link per page, and taking a minute to serve each page? Would this impact their efficiency?
Considering that as of Mar 12, this search engine only has 1001 sites indexed, I am not sure how useful this site is for getting SEO backlinks. Speaking of which, are backlinks still a thing these days?
They are, but the useful ones are those coming from sites with higher domain authority rankings.

That’s why you'll see fluff pieces (aka, paid content) from online publications like Forbes for the better funded entities.

Another approach is the reach out to site operators with offers of writing content or asking them to link to your site’s content in their existing content.

It’s expensive and/or incredibly time consuming to get back links that matter.

If the confidence was high enough, perhaps return garbage data?
> It is very difficult to fight this on a technical level

It is when your base assumption is that you won't hire outside of engineering. There are more bored teenagers with phones than people creating quality content, so I'm not sure why you wouldn't just brute force checks against bad actors.

just to throw out ideas: What if he decided to charge for each search?, say 1 cent or so. Users could purchase them in bulk, say 100 searches for a 1$.

The world is getting more and more desperate for a better search engine. the day may come, when people are willing to pay for better results.

what is the end goal here? i understand it's about making money somewhere down the road. but how?