| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simondotau 212 days ago

The more things change, the more they stay the same.

About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity.

Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests.

The scraping stopped within two days and never came back.

--

[0] Random but deterministic based on post ID, so the injected text stayed consistent.

[1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites.

[2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send.

5 comments

DamnInteresting 211 days ago

I did something similar with someone who was using my site’s donation form to test huge batches of credit cards numbers. I would see hundreds of attempted (and mostly declined) $1 donations start pouring in, and I’d block the IP. A little while later it would restart from another IP. When it became clear they were not giving up easily, I changed tack: instead of blocking them, I would return random success/failure messages at the same rate they were seeing success on previous attempts. I didn’t really try to charge those cards, of course.

I like how this kind of response is very difficult for them to detect when I turn it on, and as a bonus, it pollutes their data. They stopped trying a few days after that.

lelanthran 211 days ago

Yup. The only real way to stop bots is be convincing the operator that your data is poisoned.

That means you need to poison the data when you detect a bot.

simondotau 211 days ago

Was it always $1? If I was the attacker, surely you’d pick a random number. My guess is that $1 donations would be an outlier in the distribution and therefore easy to spot.

It’s also interesting that merchants (presumably) don’t have a mechanism to flag transactions as being >0% chance of being suspect. Or that you waive any dispute rights.

As a merchant, it would be nice if you could demand the bank verify certain transactions with their customer. If I was a customer, I would want to know that someone tried to use my card numbers to donate to some death metal training school in the Netherlands.

DamnInteresting 210 days ago

They did try adding variations to the amount (+0.50-1.00) late in the game, but by then it was ineffective, I could still quickly detect them and turn on the randomized data poisoning. I expect that they want to keep the amount small so most cardholders won't bother to look into the unfamiliar charge.

I do wonder whether these people sold their list of "verified" credit card numbers to any criminal enterprises before they realized the data was poisoned. That would be potentially awkward for them.

grishka 211 days ago

Thank you very much for the observation about headers. I just looked closer at the bot traffic I'm currently receiving on my small fediverse server and noticed that it's user agents of old Chrome versions but also that the Accept-Language header is never set, which is indeed something that no real Chromium browser would do. So I added a rule to my nginx config to return a 403 to these requests. The amount of these per second seems to have started declining.

grishka 211 days ago

It's been a few hours. These particular bots have completely stopped. There are still some bot-looking requests in the log, with a newer-version Chrome UA on both Mac and Windows, but there aren't nearly as many of them.

Config snippet for anyone interested:

    if ($http_user_agent ~* "Chrome/\d{2,3}\.\d+\.\d{2,}\.\d{2,}") {
      set $block 1;
    }
    if ($http_accept_language = "") {
      set $block "${block}1";
    }
    if ($block = "11") {
      return 403;
    }

simondotau 211 days ago

The important thing is to be aware of your adversary. If it’s a big network which doesn’t care about you specifically, block away. But if it’s a motivated group interested in your site specifically, then you have to be very careful. The extreme example of the latter is yt-dlp, which continues to work despite YouTube’s best efforts.

For those adversaries, you need to work out a careful balance between deterrence, solving problems (e.g. resource abuse), and your desire to “win”. In extreme cases your best strategy is for your filter to “work” but be broken in hard to detect ways. For example, showing all but the most valuable content. Or spiking the data with just enough rubbish to diminish its value. Or having the content indexes return delayed/stale/incomplete data.

And whatever you do, don’t use incrementing integers. Ask me how I know.

grishka 211 days ago

In my particular case, I don't mind the crawling. It's a fediverse server. There is nothing secret there. All content is available via ActivityPub anyway for anyone to grab. However, these bots specifically violated both robots.txt and rel="nofollow" while hitting endpoints like "log in to like this post" pages tens of times per second. They were just wasting my server's resources for nothing.

simondotau 211 days ago

My base advice is to make sure you have a very efficient code path for login pages. 10 pages per second is nothing if you don’t have to perform any database queries (because you don’t have any authentication token to validate).

Beyond that, look for how the bots are finding new URLs to probe, and don’t give them access to those lists/indexes. In particular, don’t forget about site maps. I use cloudflare rules to restrict my site map to known bots only.

grishka 211 days ago

Of course. My server wasn't struggling with that. I haven't benchmarked that server, but on an M1 Max, the app can easily serve hundreds of requests per second for profile pages, which is the heaviest thing an unauthenticated user can access (I cache a lot in memory, but posts, photos, and friend lists aren't among that). It was just a mild annoyance.

They discovered those URLs simply by parsing pages that contain like buttons. Those do have rel="nofollow" on them, and the URL pattern is disallowed in robots.txt, but I'd be surprised it that'd stop someone who uses thousands of IPs to proxy their requests. I don't have a site map.

AJMaxwell 211 days ago

That's a simple and effective way to block a lot of bots, gonna implement that on my sites. Thanks!

tesin 212 days ago

The vast majority of bots are still failing the header test - we organically arrived at the except same filtering in 2025. The bots followed the exact same progression too. One ip, lie about the user agent, one ASN, multiple ASNs, then lie about everything and use residential IPs, but still botch the headers

thephyber 211 days ago

In the movie The Imitation Game, the Alan Turing character recognizes that acting 100% of the time gives away to the opposition that you identified them and sets off the next iteration of “cat and mouse”. He comes up with a specific percentage of the time that the Allies should sit on the intelligence and not warn their own people.

If, instead, you only act on a percentage of requests, you can add noise in an insidious way without signaling that you caught them. It will make their job troubleshooting and crafting the next iteration much harder. Also, making the response less predictable is a good idea - throw different HTTP error codes, respond with somewhat inaccurate content, etc

wvbdmp 212 days ago

Why do the company names chase away bots? Is it just that you’re destroying their signal because they’re looking for mentions of those brands?

simondotau 211 days ago

It’s both a destruction of signal and an injection of noise. Imagine you worked for Adidas and you started getting a stream of notifications about your brand, and they were all nonsense. This would be an annoyance and harm the reputation of that monitoring service.

They would have received multiple complaints about it from customers, performed an investigation, and ultimately perform a manual excision of the junk data from their system; both the raw scrapes and anywhere it was ingested and processed. This was probably a simple operation, but might not have been if their architecture didn’t account for this vulnerability.

akoboldfrying 212 days ago

I also didn't follow that part. Their step 2 seem to be a general-purpose bot detection strategy that works independently of their step 1 ("randomly mention companies").

SAI_Peregrinus 212 days ago

It spams the bot with false-positives. Encourages the bot admins to denylist the site to protect the bot's signal:noise ratio.

akoboldfrying 211 days ago

That was my first thought too -- but then why would the bot company care about a few false positives?

I suppose it could have an impact if 30% of all, say, Coca Cola mentions on the web came from that site, but then it would have to be a very big site. I don't think the bot company would notice, let alone care, if it was 0.01% of the mentions.

rvba 211 days ago

They dont want to feed their model with garbage data, or this data is read and revieved by real humans

I remember years-ago (2008?) I worked in a company where every mention of it was manually reviewed by someone from PR department. I imagine now the tools are even better.

Different thing is that discussion is often very low quality (forums died for multiple reasons, reddit is dying too - astro-turf gallore now)

simondotau 211 days ago

Everyone’s definition of “big” is different, but back then it was big enough to get its own little island in a far corner of XKCD 802.

https://xkcd.com/802/

dotancohen 211 days ago

Diaspora?