Hacker News new | ask | show | jobs
by denysvitali 11 days ago
Cloudflare is known to use fingerprinting to detect scrapers For example, they use JA3 fingerprints and match them against the UA to block stuff like cURL while allowing OkHttp (Android clients) - but this can be easily be spoofed with packages such as CycleTLS [1].

I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection", but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go - completely destroying the privacy of everyone involved.

Cromite, a privacy conscious fork of Chromium for Android, has constantly issues with CloudFlare Turnstile [2] because they (Cloudflare) try to fingerprint it in multiple ways in order to pass the challenge. The only way to get it to work would be to join the CloudFlare Browser Developer program - which requires signing an NDA. Rightfully so, the project maintainer didn't want to do it.

If you want to see the extent of what CloudFlare does to fingerprint the browsers, just have a look in the issue [2] and see which flags need to be disabled in order to allow CloudFlare to pass the challenge.

I understand both sides, but at least CloudFlare could be flexible enough to fall back to PoW instead of just blocking people from sending forms or accessing websites...

[1]: https://github.com/Danny-Dasilva/CycleTLS

[2]: https://github.com/uazo/cromite/issues/2365

17 comments

Fingerprinting for "bot protection" is indistinguishable from fingerprinting for mass surveillance.
Sure, this is the age-old “knife used to cut steak is indistinguishable from knife used to stab people” thing.

Tools are inherently amoral; only people can have motives we can celebrate or condemn.

Is the value provided by Cloudflare to public so great, that we are willing to pay for it by enabling mass surveillance?
> Is the value provided by Cloudflare to public so great

turnstile is not a public good, it's a private product, promoted to private entities that want to achieve a certain outcome that is beneficial to them privately.

The mass surveillance is a side-effect - an externality that cloudflare does not have to pay for (but we as netizens pay collectively).

It is the role and responsibility of gov't to regulate away externality (or make those who benefit from it pay a cost somehow, to equalize said externality). Unfortunately, like with climate change, nothing has been forthcoming, and only a few people care about the actual damage enough to even talk about it.

So it will go on, and the masses do not have a say.

That’s a better question. So far the answer seems to be yes.

Large companies and banks see >95% fraud on sign in / sign up flows. It’s a constant battle and the law of large numbers says even a tiny false negative rate can be catastrophic.

A bogus GCP or AWS or Azure account costs those companies hundreds to thousands of dollars. I don’t know what the average loss is on fraudulent bank signins, but probably on that order. And there are millions, sometimes billions of attempts per day.

I worked at a tech company that used an off-brand, truly awful captcha provider. Think “drag the mammal to the habitat it lives in, avoiding the wiggly lines”. When this awful provider went down (frequently), we fell back to recaptcha. Fraud rates were 100x higher in those minutes-to-hours outages. Though of course real users were also able to get in at higher rates.

Considering how much of the internet is already trying to track me? Yeah, Cloudflare provides more than enough value.

It's pretty clear that this is being done to solve an actual problem that they and their customers have. I'd prefer if it wasn't necessary, but I'll take this over solving challenges any day.

the toolmakers and merchants aren't inherently amoral, though. if you're making kitchen knives to sell at Crate & Barrel you probably sleep soundly. if you're filing down shivs to sell to street gangs, they're probably not using them to cut steak, and you know that.

so as a toolmaker (presumably) you still have to answer for what you do.

I guess then you wouldn't mind if I cut you in order to verify you indeed aren't a steak, would you?
Joke’s on you, I am literally made of meat.
Talking about mass surveillance: After taking the usual measurements against cross-site browser tracking- who knows most about my website visits? Meta, Google or Cloudflare? Blocking me from site visits with fingerprinting shut off, forces all my traffic back into the CF funnel. Number of websites soaring. Try it yourself https://sereneblue.github.io/chameleon https://github.com/kkapsner/CanvasBlocker/ and you're increasingly off.
And incentives mean those doing the former will also do the latter.
I would like my browser to not pass their challenge and then flush support of services I cannot reach. This is the only way for them to stop, to really get on the nerves of their customers.

Those might ignore it, but there are always alternatives.

They'll just tell you to clear cookies and use Chrome.
That is delusional. Nobody is getting on anyone’s nerves, materially. The people who care about this are a rounding error of a rounding error.
It’s always amusing when someone brings up the “just tell banks that if they reduce account takeovers by 80%, it will drive off 3 customers a year (and those are the same 3 customers who call site support to complain the website doesn’t work well on their homebrew Chromium for when running on BSD”

Cloudflare only exists in its current form because banks and such already enthusiastically accepted that trade off.

I don't think it is delusional. Maybe ineffective. I think it is delusional to just accept these privacy invading measures as inevitable. Especially for software services today, there often is an alternative.

Most businesses don't have the luxury to be able to not care for the customer.

Cloudflare is a service provider for third parties, not the product I want to consume.

> but unless you do PoW (which is also ecologically a nightmare)

Can you expand? I don't see a problem with some napkin math. 5W load for 2 seconds is 0.002Wh (we have to let smartphones pass and not by doing PoW for 10s of seconds). 8 billion checks a day for a year = 8GWh.

I stand corrected. It's not a nightmare scenario (as for Bitcoins) - but I'm still of the idea that "useless" computations should be avoided (as we should avoid having 10MB websites).

In any case, according to some napkin math done by Kimi 2.6 (which by itself is probably already consuming more than all of my PoW challenges for the upcoming 5 years) - the situation looks incredibly in favor of PoW: https://www.kimi.com/share/19e7ef40-a432-8912-8000-0000b4a71...

Which makes me wonder why CloudFlare isn't switching to this already

There's a saying that if an idea is stupid, but it works, it's not stupid.

If some computation is "useless" but it serves it's purpose, it's not useless.

The reason why bitcoin network expends so much energy is down to tokenomics, not the system of PoW itself. At equilibrium we expect the power usage to be (blocks/hr) x (BTC/block) x ($/BTC) x (kWh/$), so it's a function of the BTC price and emission rate.

PoW in other context has way different driving factors. In this case, the marginal improvement of fetching the site again for AI bots isn't enough to cover the PoW cost. The PoW cost is outweighed by the net bandwidth cost of all the parties.

I mean coal power plants work, so building new ones is not stupid by that standard.

I think we have to expand the definition of stupid to include things that work but have net negative externalities. Not sure where PoW falls in that way of looking at things, but we should at least consider it.

(Thinking about it, Captcha is PoW, just theoretically work by the human)

Necroing this, but perhaps you might be interested in some sort of BOINC-like PoW scheme for websites. This was a distributed computing project originially known as Seti@Home. It's not really practical for cryptocurrency PoW applications (despite its use in Gridcoin) due to the centralized nature of the challenge-response, but certainly more useful than captchas or hashes!
Because it doesn’t solve the problem of residential botnets.
The botnet operators will be incentivized to mine bitcoin instead of whatever they are doing.
Neither does fingerprinting.
The goal of Cloudflare’s fingerprinting is to detect whether a user agent appears to be a legitimate human. It’s not to identify human users across websites.
That is not a good excuse for requiring overly complicated and overly specific software.
Why not? PoW challenge doesn't whitelist botnets. If the dumb scraper makes only get requests and doesn't solve the challenge, it doesn't matter how it connects, even if it's a perfectly hidden tor exit node.
Because the work would be done by the compromised residential device. No bothnet owner is going to care if their 100,000 rooted routers have to do a little more work. It’s still “free” from their perspective.
If botnet owner allows RCE, the botnet will just change the owner.
Because you can't have both a difficulty with a reasonable page load time and a difficulty that stops bad actors. Attackers have stronger machines and are willing to wait as long as they need to.
8 billion checks per day sounds on the low end. I can imagine it being ten or hundred times more. That still seem pretty fine though. On the other hand, it's hard to see that such a modest energy cost would dissuade any attacks.
> I can imagine it being ten or hundred times more

I don't think I average even 2 captchas a day being terminally online, so 10 across every soul in the world sounds way too much for me. (we're ignoring bots it's meant to deter?)

> it's hard to see that such a modest energy cost would dissuade any attacks.

It's not against targeted attacks, but scrapping.

And not about energy cost, but available compute power -- it requires scrapper to use browser with JS (or time commitment to reimplement PoW outside of JS), limits their request rate by CPU core count.

> I don't think I average even 2 captchas a day being terminally online, so 10 across every soul in the world sounds way too much for me. (we're ignoring bots it's meant to deter?)

You're mixing up checks, fingerprinting, and PoW with a captcha being triggered because those didn't pass. The less abnormal your setup is, the fewer captchas you'll get.

I agree with the rest of what you said.

Also I think you mean "scraper" and not "scrapper".

>probably fingerprinting is the way to go - completely destroying the privacy of everyone involved

your doctor seeing you naked does not destroy your privacy, it's your doctor sharing the photos with everybody that does. i.e. it problem here is that intermediaries like cloudflare don't work for you, they work for somebody else or sell the data themselves.

Wait, is your doctor taking photos of you naked?
Molescan doctors do - they map the skin, log it, and look for changes over time.
Brave has aggressive fingerprinting protection, I have Auto-Shred (formerly Forgetful Browsing) turned on, I use VPN and yet I rarely get gated out.
A testament to how well Brave protects you from being identified by [Cloudflare in this example]
Not sure what you mean, Brave blows Firefox out of the water in terms of privacy protections. Firefox has milquetoast fingerprint protection and it doesn't even block ads. uBlock is worse than Brave's blocking by virtue of not being natively integrated.
> but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go

Only as long as legislation and law enforcement is off the table. Almost like we have those because everyone doing their own policing is not a reasonable way to run a society.

This is why I have two separate browsers. If you want to do official stuff like paying for things you need to get through cloudflare.
You can use Firefox with different profiles and configure it to launch particular profile directly, without launching default profile and using about:profiles.

Firefox with a non-default profile can be created like that:

  ./firefox -CreateProfile "profile-name /home/user/.mozilla/firefox/profile-dir/"
  # For, say, cloudflare that would be:
  ./firefox -CreateProfile "cloudflare /home/user/.mozilla/firefox/cloudflare/"
And you can launch it like that:

  ./firefox -profile "/home/user/.mozilla/firefox/profile-dir/"
  # For cloudflare that would be:
  ./firefox -profile "/home/user/.mozilla/firefox/cloudflare/"
So, given that /usr/bin/firefox is just a shell script, you can

    - create a copy of it, say, /usr/bin/firefox-cloudflare
    - adjust the relevant line, adding the -profile argument
If you use an icon to run firefox (say, /usr/share/applications/firefox.desktop), you'll need to do copy/adjust line for the icon.

Of course, "./firefox" from examples above should be replaced with the actual path to executable. For default installation of Firefox the path would be in /usr/bin/firefox script.

So, you can have a separate profiles for something sensitive/invasive (linkedin, cloudflare, shops, banks, etc.) and then you can have a separate profile for everything else.

And each profile can have its own set of extensions.

They're blocking Firefox quite often. Stripe does something that makes Firefox hang. I use Chrome for those sites and then go back to Firefox...
You do now do this from `Profiles` menu too, without going down to CLI path. It's extremely simple now.
If that works for you - that's fine.

I'd argue, that for some, CLI path is actually cleaner.

You see, the way described above creates entirely separate points of entry, and you don't have to go to the central menu to launch specific profile.

It eliminates one step (Profile Manager, about:profiles or whatever) allowing you to get faster to the desired profile - same way you'd launch a default profile.

It's logical separation too. It's like separate browsers from UX standpoint (they do use the same distribution though ...unless they aren't - you can configure different distributions for different profiles - nothing stops you from that).

We are not in any kind of disagreement :)

I'm just leaving the information about the gui option to other who may not be aware that it can be done from the gui too, and think its difficult to do in Firefox.

What does profile-switching provide that switching containers within a single profile doesn't?

Edit: I RTFA'd, containers can't adjust `privacy.resistfingerprinting`. Boo

- Independent set of extensions (independently configured) for each profile.

- Independent set of settings/about:config parameters.

You can't turn off, say, WebRTC completely for some profiles, while allowing it for other profiles.

Different history. I remember accidentally nuking history of a few years - that wasn't fun. Now, you reduce blast radius.

Proxy on/off or different proxies. Though, there's probably an extension that manages it on per-site basis.

Different userChrome.css, if you fancy that.

>I remember accidentally nuking history of a few years - that wasn't fun.

If you are on linux I can't recommend enough using a COW filesystem like btrfs and zfs with snapshots. I can't count the amount of times i have wiped or edited something by mistake and then restored it within seconds with it.

Except that fingerprinting means that both profiles are actually tied together by cloudflare (and other tech companies)
I think the idea is that they have the functionality that cloudflare is using to generate the fingerprint (like webGL in this case) disabled in their non-cloudflare profile and only use the cloudflare profile to do things they have to that are behind cloudflare
that's why I use completely different browsers with different settings. my CF-friendly one (not my daily driver) is `firejail --private chromium` so it always starts with a clean temporary profile
Firefox added profile switching recently. Works good.

(That said, I still keep separate machines. One for doing "official" things, the other for everything else)

> Firefox added profile switching recently.

I think this was as recent as 25 years ago?

Recently they added some new UI. There was and still is (I think) classic Profile Manager UI, which you can launch with

  ./firefox -ProfileManager
or access UI in about:profiles.

But you don't have to use any of those anyway - see my comment above (a response to parent).

They actually have at least 3 kinds of profile: 1. containers - As they say its somekind of sandbox, technically a profile 2. profiles that are accesible through about:proflies, which they had for years, and probably the one you are talking about... 3. New profiles that comes with a pop-up much like how chromium browsers shows it
The old UI was pretty difficult to use, and hard to discover unless you knew where to look though.
What about the old UI is difficult to use? I am assuming you are talking about the profile manager.
Odd - they've had that for years, but only on the command line. Wonder if it's different under the hood? They also have firefox containers which also never quite became a first-class feature (you have to install a plugin).
>Works good.

does it? same binary, same machine, same display, same 781 other heuristics.

Micropayments would be another one, but then governments and banks have to give up ~~financial control & surveillance~~ AML essentially to make it financially viable. AML also has a horrible track record of how much money is spent compared to the amount recovered.
Anti-money laundering laws are a deterrent. If you know moving around $10,000 will be reported to FinCEN, as will any discovered pattern of structuring transactions below $10,000, then you are forced to pursue riskier ways of moving your money than Western Union.
Wouldn't micropayments be aggregated into a money laundering operation by some third party? (Wasn't IIRC even Spotify used for that?) Or would Cloudflare take all the money in this hypothetical scenario?
PoW doesn't fix anything if you have an army of zombie CCTV cameras and smart fridges at your disposal.

It's either proof-of-humanity (increasingly hard to get in this day and age, particularly if accessibility is a concern), proof odf identity (even worse) or proof of system integrity, which is the least bad out of all the terrible options.

Why wouldn't PoW help? If it's tuned so that each device in that army takes 10 seconds instead of 10 milliseconds to make a request, have you not slowed the army down by a factor 1000?
You just need 1000x more zombie fridges, which may be still acceptable for some bad actors.
Sure. But this is kind of vacuously true for any real world DoS scenario. It's like saying "sure, your new weapons system might wipe out 999 out of every 1000 of the enemy's forces, but what does it matter, they can just scale up by a factor 1000 and we're back where we started".

Anything that amplifies the cost and effort required by the adversary by several orders of magnitude is worthwhile discussing.

But more expensive (to get). At the same time, the PoW would require more compute power do the same device will still be capped at the same rps
Then every normal user has to take 10 seconds as well, which is an awful experience.
Presumably most users visit the site with more compute-capable devices than a fridge. But I do agree that it's sad that such an approach artificially worsens the internet experience for people on older/weaker devices. On the other hand, Cloudflare's Turnstile also significantly weakens the experience for everyone.
They're also anti free speech.
> I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection"

They also gate away a good many people with their "bot protection". I am extremely worried about how so many seem to have outsourced the control over who can access their websites to a company, with no second thoughts whatsoever.

The problem is what is the alternative? I'm (not) defending them or this practice by any measure, but we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system. I've hated CAPTCHAs ever since I first encountered them and I can't wait for them to just finally die a permanent death, but I also don't know how we solve the "how do you identify a human and a bot" in a way which doesn't require server admins to have extremely beefy servers or similar setups to handle the extra load. I'm not going to do the "there HAS to be a way thing" either because, for all I know, this could just be one of those impossible-to-solve problems.
> we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system

No, we don't know. I honestly do not understand the problem. I run websites, both static and non-static. Granted, my sites aren't exactly the most popular internet go-to destinations, but I should be seeing this DDoS too, right?

I do see lots of requests. Nothing that any modern system can't handle. Computers are stupid fast these days. Unless you are doing something unreasonable, it's really hard to even notice this "extra load".

I understand there are sites for whom this causes problems, but I think these are rare and could be optimized not to do unreasonable things.

I think too many people are annoyed by AI companies (arguably understandable position), look at their logs and speak of "hammering", "DDoS" and "extra load", while in reality it doesn't matter much.

We do know, just ask anyone who runs a more popular site or does anything where abuse can be monetized (shopping, reviews, etc.). Avoiding that due to obscurity isn’t an answer because it’s saying you’re safe until something, possibly outside of your control, causes the bots to descend and give you an extra 500M requests with no chance of revenue.

I’m with OP: I don’t like this but the alternatives all look like the death of the open web.

> just ask anyone who runs a more popular site

The person you're responding to already said they ran a modestly sized site. What actual scale opens one up to abuse? If only the top 1% of sites need it, then it seems silly to say "everyone" needs it.

It’s not just scale. Do you accept user generated content? If so, more of a target.
So everyone is paying cloudflare… why?
It might depend on the tech stack. I run a small niche website but it has PHP and a database (MediaWiki/PHPBB) and without Cloudflare I'd estimate I'd need to spend several hundred dollars a month to handle the traffic. Traffic used to be tens of thousands of requests a day. AI has increased that to between 400k and 3M requests per day but it's not a smooth distribution. This is with bot fight mode on that greatly reduces traffic.

I adopted Cloudflare because it was getting DDoSed by the AI crawlers. I'm pretty sure all of them are vibe coding their crawlers and don't bother adding rate limiting as a requirement.

That was my point. I was trying to be gentle by mentioning "unreasonable" things, but seriously — how did we get to the point where less than 6 requests per second (that's 500k requests per day) is considered a DDoS?

I've spent some effort on optimizing my sites, but most of the effort was focused on avoiding unreasonable (stupid) work. Do I need a session for every request? No, I don't! Do I need a database fetch for every access to my homepage? No, I don't! Is it a problem to actually load all of my static content in all supported languages (24) into memory and serve it from memory? No, it isn't!

I use Clojure behind nginx on the server for my sites. Oh, and I also pre-compress all static assets to Brotli, so anything that handles brotli gets a static file served directly from nginx. I also use immutable assets with unlimited caching semantics.

Really — the problem is that we've grown lax and our software has become bloated, slow, and with unreasonable code paths. If every page fetch does 12 database accesses and runs through a slow interpreter, that is surely going to be a problem.

That's the traffic after rate limiting controls and bot fight mode. It's 3-4 million requests per day without bot fight mode and just rate limits. And as I said it's not a smooth distribution. Plus the requests are almost never for pages in cache. It's always stuff like loading all the message threads from the year 2000 or loading up the details of every page edit ever made to a wiki page.

If it was more static content it'd be easier, it's really the db being a bottle neck in a dynamic site.

Yes, the software could be better optimized but then I'd have to own the development of it. There is no reason a niche website should be getting millions of requests per day.

I second this. My website exposes a cgit and 99% of the traffic now is AI scraping the sources, but the load is nowhere near DoS territory. And this is running on the cheapest VPS I could find.

Not saying I'm not annoyed by the scraping; I am looking to block them, but I'm also not going to put the site behind the gatekeeper. If anything, Cloudflare must love AI scraping now for the same reason AV companies love malware.

Now, if you are running a PHP stack...yeah, maybe that's the problem right there.

Is there actually any plausible theory why "AI" would repeatedly scrape the same sites? Are there that many competing, completely independent AI labs? Is it cheaper to repeatedly scrape than to buffer the scraped data locally? (I find it very hard to imagine that it's easier to deal with changing/disappearing content than it is to stand up such a cache.)
If you ask an agent to check sources / function definitions of open source packages it will wget / curl it
It's an AI generated scraper that scrapes nonstop.
> 99% of the traffic now is AI scraping the sources

I wonder if we should stop fighting this and instead create an API specifically for this purpose? Or, a central repository that you could send your data to and say to anyone wanting to scrape, "safe yourself some time and just get my data from this other place"

The thing though is that they are extremely idiotic. They are constantly, recurringly, scanning the same files, I suppose out of FOMO that a line might have changed. I don't know what a special API solves, especially because HTTP already has etags to save you from re-downloading the whole damn file over again. But these bots don't care. The extent to which they don't care is such that, after I temporarily took cgit down for kicks, they'd get 404s and still repeatedly ask for the sames files days on end.
The PHP stack isn't even the problem, it's having unauthenticated requests getting past the cache in the first place, something that most sites should be able to prevent.
If you're in any way semi-popular and a decent size, you're gonna get hammered. PortableApps.com was partially offline for weeks due to China-based AI scrapers. You block the useragent, they start hitting you with another one from the same IP in the same way. You block the IP, they switch to another. You block the subnet, they use another. At one point it was nearly a thousand different IPs from around China hammering away. For all intents and purposes, a DDoS. This wasn't a little "extra load", this was load that was thousands of times beyond what our legitimate userbase was using.

And if you're thinking about blocking all of China, while this particular AI bot didn't use them, a bunch of other ones I've encountered use VPNs and hacked clients worldwide.

Consider yourself lucky. But don't let yourself fall into the trap of thinking it's a nonissue for everyone else until it happens to you.

People shouldn't have to be experts or provision a larger server to run a UGC service that can withstand the sort of 30x more traffic I'm seeing from AI bots. Or rather, you didn't render the argument for why they should have to do that if they can just use CloudFlare's free tier.

Either way, it's easy to have all the answers when you've never had the problem.

Has anyone pointed an AI scraper at your server at all? Unless your website appears in search engine listings I don't think the AI scrapers will slam it. My server has never been hit by them but my server is also practically unknown. All of this said, I'm not going to claim that server loads can handle it because many sysadmins have claimed otherwise, and I would like to think that their claims are reliable.
As soon as you get your TLS certificate you get bombarded with scraping. You don't need someone to "point a scraper at you".

What matters most is usually how much there is to scrape. If you have like 5 pages that's nothing. For forum like websites where each thread, each user profile, etc. gets scraped that's when traffic increases. I just let them have at it with no issues though, computers are fast.

That's really weird. My experience is quite different: I have several subdomains and all of them have TLS certs and I haven't (yet) seen this (thankfully). Either that, or my server is masking it. The weird thing is that my server is an OVH dedicated box that doesn't exactly have top-tier specs, so I have no idea what's going on there. Very weird indeed.
If you run the site on a custom port, scrapers won't find it?
Also, how do we even know they're really "AI scrapers", or just a deliberate DDoS to push sites into using CF or other "anti-bot" providers?
They showed up when the AI money did. The evidence is circumstantial, but… some of them are remarkably well engineered (from a “how difficult is it to identify this traffic” perspective, in a way that never existed before (I have been running a quite sizeable site for 8 years, over 200k registered users, and you don’t need to register to use 99% of it).
A small, single EU country focused non-static e-commerce, with proper robots.txt instructions that worked perfectly well in the search & co bots -only "era" with rate limiting for nginx/php-fpm setup - is kinda struggling without CF to handle 15000 requests per 15 minutes, coming from Chrome "users" from IPv6. Best so far was an avg. server load in htop = 40 on an 8-core server x_x
That's 16.6rps. A single guy holding the F5 key on chrome can generate that much traffic and take down your website. That kind of performance was never acceptable.
People will always reframe their request numbers to avoid stating their pitiful requests per second numbers, it's hilarious. "This thing is handling hundreds of thousands of requests per day!" Like cool, you're barely making it double digit requests per second.
> handle 15000 requests per 15 minutes,

that's just ~17 req/sec

That's "cheap VPS running wordpress" level of traffic

Maybe a plain WordPress install. Run something like WooCommerce and install a bunch of plugins to get the functionality that WordPress and WooCommerce should have built-in, and suddenly a cheap VPS can only handle 2 or 3 requests per second.

It's phenomenal how inefficient the WordPress/WooCommerce stack is.

Though the main issue I'm seeing is credit card testing, not scraping.

And I'm ideologically opposed to using a CDN (because it shouldn't be needed for such a small site!) so it's somewhat a self-inflicted problem...

You can calculate traffic stats for a day by IPs/subnets and probably bots will stand out. If they are using IPv6 you can figure out the ASN and block it completely.
Block out IPv6 and see if that helps.
Why not block all odd v4 addresses while you're at it? I heard that that can reduce scraping volume by 50%!
You get downvoted for these opinions but I agree. Most people that complain that their servers get hammered by AI bots are those that run very unoptimized servers that can only handle like 100 rps. I've never had any issues with any of my moderately optimized websites. A $10 VPS can handle sooo much traffic.
I think people get annoyed when it's suggested they spend time optimising or even re-writing their websites to handle high traffic loads just to cater to AI bots ripping their content.

It's also not always easy to do. I run a small wiki which is fairly optimised, nearly every page manages at least ~3k rps on a small VPS. The only exception is the diff page which is ~150 rps. Optimising that while still giving good output isn't that easy, but the wiki doesn't have many users so that would be fine if it wasn't for the AI bots.

The AI bots ignore robots.txt and were initially hitting the site with ~1k rps crawling every combination. Even that would be manageable as there's currently ~150,000 combinations, except they kept re-crawling the whole lot each day. The server could manage it but it was a massive waste of resources.

They were using residential IPs and only sending 1 request from each IP making it impossible to block. In the end I gave up and put a Cloudflare challenge in front of it. I don't want to use Cloudflare but the alternative is forcing users to login to view diffs or remove them entirely.

What I do is have more strict rate limits for non logged in users. You tell them to log in if they hit the rate limit. For non logged in users, you have a rate limit not just for IP, but also for /24 and /16. Forget about IPv6, IPv4 scarcity is a feature not a bug.
Curious, but how do the bots figure out the combinations? Or do you have links to the diffs from other sites? I assume the diff takes two files in query parameters or something.
There really isn't a good reason for a wiki (or git host) to provide diffs between arbitrary revisions to unauthenticated users. Limit it to diffs compared to previous (which can be cached) and this problem goes away.

In any case, such labyrinths of expensive dynamically generated pages are no excuse for subjecting people requesting the start page to bot checks.

I see many mediawiki wikis (like the Arch Linux wiki) using anubis succsefully. It can be configured to only act on certain paths.
I managed to solve my scraper problems without optimizing much, but if I had to optimize I think the only option might be "don't use mediawiki" and that's an extremely obnoxious solution. Though maybe I could get there by throttling specific kinds of pages.
Same. Tritium and the blog have done stents on the front page here and high traffic subreddits and that plus bots has never been a problem. UX could be improved through a CDN but even that isn’t worth the trade-off for us at the moment.
> I understand there are sites for whom this causes problems, but I think these are rare and could be optimized not to do unreasonable things.

There are. They're not. They can't (without significant effort)

I don't think it's just privacy, it also increasingly turns the web itself into a walled garden. The end result is that websites can only ever be accessed by "approved" clients - the latest Chrome, Edge, Safari and if you're lucky Firefox - and nothing else.
> and if you're lucky Firefox

I haven't had any problems with Firefox so far. Why do you say this?

That was more a (gloomy) outlook into the future, given Chrome's market dominance and tendency for unilateral actions in web standards.
I haven't ever noticed Cloudflare having any issues on Firefox, so presumably that implies any unilateral actions in web standards have been worked around by CF to provide the service to Firefox as well.
It's already a problem with Firefox + some essential web condom extensions.
I think there's some chance we get a "proof of purchase" system where there is some entity that takes a $10 payment to give out a unique identity token that you need to present to visit most sites. if you have a revocation process for ones used for bad actors, it seems like it would work pretty well.
That's called an IP address. You pay your ISP $50+ every month to get one. Has it worked so far?
If the bad guys also had to pay $50/month/IP it would probably work.

The bad guys don't pay that much. And sometimes the bad guys actually use the IPs of other people (botnets on residential IPs) and don't pay anything at all.

They pay something. You can get a few ten cents per gigabyte for a voluntary proxy right now. I've never tried it long enough to get a minimum payout, so could be a scam for all I know (or maybe the minimum payout is the scam).

What would stop you offering someone a few tens of cents per GB to borrow any other token barrier you put up?

Except if your country is under sanctions.
> we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system

So delegalize it. Strip searching everyone to paper over the fact that the societal contract has been broken only delays that.

> AI bots which hammer servers

You can easily calculate which IPs/networks bots are using by looking at where most traffic comes from and who requests lot of pages with non-human speed.

Each IP address is either from a residential proxy network, or from AWS / GCP / DigitalOcean. And each IP requests at human speed. 1000 of them are an issue though.
If you aggregate over a day, it might become more obvious. Also, datacenter network is a big red flag.

By the way, what's your opinion about running a cryptominer on requests from datacenter and bot IPs?

We have few dozen websites, from ones doing single digit Mbit to few Gbits.

Never needed it. Just put the worst offenders in penalty bucket and that's usually enough

The alternative is not have that one choke point that can be hammered. Decentralize.
I use CF and i don’t enable these anti bot measures. It’s up to the web master
Anubis is one alternative, kinda sucks that we need to slow down the web for everyone a little bit though.
The most plausible near-term path is probably micropayments embedded invisibly in AI agents. Your agent that has learned what you value and can make a reasonable decision to allow a micropayment for certain content pays on your behalf without requiring a conscious decision each time, eliminating the mental transaction cost problem entirely. It's the mental transaction cost that arguably led to the failure of the micro payment model back in the early 2000s.

Although the cynical part of me says that this will result in malicious actors trying to trick agents into giving out a bunch of micro payments. There are counter defenses that can help detect and compensate for that, but perhaps the best we will be able to do is prompt user with the default agent recommendation.

I can no longer access any website that's "protected" by Cloudflare. As soon a website enables that stuff… "Shoot, another one bites the dust." I wonder if the website owners realise at all how many actual users they lose by this sort of "protection."
Cloudflare will just tell them that 70% traffic drop is because 70% of their traffic was bots, and everything is working fine, and hey, don't you want to upgrade to a paid plan to block 50% of the remainder? Think about how many bots will be blocked with that upgrade!
Do you really stand by these words?
I'm one of those who have enabled cloudflare on all of the sites I maintain. Additionally, Added turnstile on every form.

I know some actual users get blocked. But the amount of spam we get without it, the amount of bot traffic simply overwhelming the server... It is just too much.

Recently I also hard blocked all IPs from china Singapore India Pakistan Russia and whole of africa. Do I want to do it? No. But the amount of bot traffic and corresponding spam is a bigger problem :(

I also always block traffic from China, India, Pakistan, and Russia, after observing that 90%+ of the spam/scanning was coming from those countries.

At least for China, I imagine most of the real humans might use a VPN anyway

  > I know some actual users get blocked. But the amount of spam we get without it, the amount of bot traffic simply overwhelming the server... It is just too much.
So why not just shut down the website? Or remove the form entirely? That will ensure that you get no spam, right?

One of the core tenets of system design is Availability. If your service is not available - if your forms are blocking legitimate users - then why are you pretending to have a form submission feature at all? Just to frustrate users?

> One of the core tenets of system design is Availability. If your service is not available

The service won't be available to anybody because of overwhelming unwanted traffic. Now it's available for most potential users. You're speaking econ 101 when everyone else has played out iterated prisoner's dilemmas.

> So why not just shut down the website? Or remove the form entirely? That will ensure that you get no spam, right?

Turns out that people have a tolerance for a non-zero amount of work, but still have a limit.

Suggesting "turn off your website" is does not account for the desire to also provide some access.

Treat people who host content as humans, just as we must treat users as humans. There are tradeoffs, suggesting "shut down your website unless you provide access everywhere" is worse on all fronts for everyone.

> There are tradeoffs, suggesting "shut down your website unless you provide access everywhere" is worse on all fronts for everyone.

Maybe, maybe not.

If block-heavy websites shut down entirely, we lose some content, but other content moves to block-minimal sites and the average user might be able to access more.

Also if there's no blocking crutch, and people get pushed into shutdown and are mad about it, they might fight harder for anti-spam technology and legal enforcement, which could improve the situation.

Yea, honest admins block entire regions because spam and bot traffic make it impossible to stay open
>I wonder if the website owners realise at all how many actual users they lose by this sort of "protection."

How many people do you think are browsing with a weird enough config (eg. custom browser like OP, or some weird config like firefox with fingerprinting protection on a raspeberry pi) to trip cloudflare's protection?

Well… I know plenty people in my circle affected by this. Just have a slightly outdated system you simply can't afford to update: it's way to easy to get cut off like this. IMHO, a rather systematic discrimination of poorer people.
I got locked out of some websites by Cloudflare Turnstile on some very standard configurations, like an iPhone on Safari, or a Windows 11 desktop with Firefox or Edge, neither with a VPN on. I never found out why.
it's probably because a scraper farm updated their services to latest, and there was a window where fingerprinting was unable to differentiate.

We had all of our Devs Pixels get blocked, and after talking to CF, it was because Internet archive was rebooted their scraping farm, all the devices stampeded and overwhelmed the known bot safeguards, and those tags were added across the board. CF gives sites the tools to tune what is getting blocked, we bumped the sensitivity down to 25 and haven't had many complaints (despite having a very vocal community)

The most common complaint is users' IP address getting blocked because of compromised devices

Does not have to be weird, at least once it happened to me that their strictest settings simply banned something like major portion of internet users in my country - to the point that if you had FTTH you were likely blocked.

And no, it wasn't due to a country-based block selected by site operator.

There are dozens of us :)

In my experience what really makes it loop every single time though is JShelter. CF doesn't like having your fingerprintable data bits messed with.

There are legitimate uses for non-instrusive, ethical and legal scraping, but some of us have had to resort to extreme measures:

https://roundproxies.com/blog/bypass-bot-detection/

Do you by chance have that installed? I don't use Cloudflare but I am curious if that code can scrape my silly blog? [1] Trying to pick the appropriate article... I'm guessing it can. I don't do the fancy javascript or TLS fingerprint inspections, just some janky hill-billy protections, silly redirects and Antarctic voodoo.

[1] - https://blawg.nochan.net/b/Internet-Crap/20260522-Maybe-AI-B...

I use a plain Firefox on a plain Windows 11 PC on a plain regular mass market ISP in a developed country and I get completely blocked by websites daily.

At least let me complete a "prove you are human" challenge or something, but don't outright ban my IP address?

Weird? I live in Thailand, use Firefox, and get half a dozen CF challenges per day.

It takes very little for CF to consider you "weird".

I took the time to write to one on LinkedIn and they didn't reply
>wonder if the website owners realise at all how many actual users they lose by this sort of "protection.

Yesterday cloudflare blocked me from visiting the MX-Linux site ... including an old browser with -no- protections ...

I have to wonder - assuming these sites are paying CF for this 'service' - are they getting a list of all the fejected IPs?

> with no second thoughts whatsoever

As someone responsible for mitigating card testing "attacks", account harvesting, and DDOS attacks..

It is unfortunate, but the ISP industries(from telco up to transit) and CC industries aren't providing a lot of great options. This idea that people are doing things "without a second thought" is usually false when it comes to businesses.

>I am extremely worried about how so many seem to have outsourced the control over who can access their websites to a company, with no second thoughts whatsoever.

I think the Web is on its last legs, anyway. Generative AI and LLM-instead-of-search has destroyed what little value remained.

Governments too. It's inevitable that the international network will fracture into multiple national networks with heavy filtering at the borders as each country scrambles to impose their laws on it.

I'm glad to have known the true internet before its demise. Truly one of the wonders of humanity.

They sometimes have to comply with legal requests (which I understand), but at the same time they have a huge market share - which means that the internet is becoming less and less decentralized and more in their control. We've seen the effects of that in previous outages...
I think what gives me anxiety about the whole situation is:

1. If X% of the population gets wrongly branded with the scarlet letter B[ot], how do they appeal and get it fixed?

2. How will sites notice and know if their choice of "bot protection" is losing them X% of users/customers/job-seekers etc.? If it's a really robust system, they'll never even see the complaints either...

3. If everyone does detect that something is awry, will it be such a monopoly that there's no choice but to let it happen?

I use a cellphone internet provider, there have been many a sites I couldn't access because or cloudflare or stupid recaptcha. i know damn well what a bicycle, bus, traffic light or stairs is.
It's just one more facet of the enshittoscene, the era where actual product quality is completely irrelevant. Put it in the same bucket as websites that lag when you scroll, apps that refuse to show you video without a huge play/pause button overlaid in the middle of it that never goes away, and the movie Melania. My hypothesis is that billion-dollar businesses no longer exist to sell things to customers, but only to impress other billionaires to get their investment money.
besides proof-of-work, is there any realistic alternative to fingerprinting?
> I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection", but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go - completely destroying the privacy of everyone involved.

Bot protection with fingerprinting is just an illusion. Any signals like this which is on client side can be spoofed by an above average person. Fingerprinting is just way to consolidate the market for advertising business. Assigning Reputation to residential IP addresses and commercial blocks is is another approach to achieve the desired result. Providers would be a lot more careful to allow their IP addresses for misuses, however turns out that it would bring down the DDOS business on both sides, attackers and protectors.

Ironically, more than often its the same companies that invest in building their own bots and finding ways to stop bots from other companies.

> Bot protection with fingerprinting is just an illusion. Any signals like this which is on client side can be spoofed by an above average person.

At the upper bound, fraud can always be committed by paying real people with real accounts to perform the desired action in a way that is 100% truly indistinguishable from organic. There's fundamentally actual prevention technique at the limit.

So the entire game is only "increasing the costs until it's not viable ROI", not "holistically prevent", which is why fingerprinting is a relevant technique here.

> entire game is only "increasing the costs until it's not viable ROI", not "holistically prevent", which is why fingerprinting is a relevant technique here.

As per cloudlare's own report, about 78% of the DDOS attacks are at the network layer where the fingerprinting technique is not useful.

DDOS is done against targets for certain reasons, most businesses are not even viable targets for everyone.

However letting everyone being fingerprinted on the pretext of solving the DDOS is where the privacy gets compromised (not much of it is left though). Some search engines did it indirectly by letting people use tag managers for free in their website and then utilize the data for their advertising business.

Relatively the end game is same, its just how these companies are approaching it.

Fingerprinting to detect bots seems mostly relevant for things which are not DOS, so that percentage doesn't seem like the relevant one.

Bots manipulate review scores, posting link spam to other users, crawl your database that isn't open to crawl, etc.

I mean all bot protection is useless at the end of the day, every time I have to bypass it I can do so in roughly 3 to 5 hours both 2 years and and more recently around 1 month ago. 2 years ago it was an absolute joke and only took me 30 minutes.

Well I mean maybe it wasn't useless 2 years ago, but in the age of AI it definitely is.

JA3 fingerprinting is really not a serious deterrent, there are many ways to get around that. curl-impersonate works. You can even just use an actual Chrome instance with the devtools protocol, seems to pass as long as you don't use headless mode.

The WebGL fingerprinting thing is cute, too. I guess it'll buy them some time since off-the-shelf solutions are going to probably not handle this well yet. That said, as long as the reward for bypassing turnstile and other anti-bot protections remains high, these things really can't do much. A decently resourced adversary can probably come up with a dozen different approaches to make this less useful. Without really looking into it much, my kneejerk is you could probably tweak Mesa to have deterministically random behavior for whatever edge cases it looks for, but you could also just have lots of different GPU/driver combos to proxy to. The web gets less open, but in an asymmetrical way. If you really have an incentive to keep botting, you'll surely find a way.

The next step is to fully give up and just essentially implement WEI. And then the bot problem disappears?

Nope. Botting will still hold tremendous value, so likely there will be many crafty workarounds and bypasses over time. And there will be countermeasures for those and workarounds for that. Guess we'll start to find out who actually has the resources and incentives to keep botting in this environment.

So what's the real solution? Well the most obvious thing to do would be to make botting less valuable. Can we? I dunno. It may have been a mistake to move so many important things to the Internet after all. I mean, some of this is just threat actors catching up with what's possible and was inevitable to begin with. But, some of it is just trying to find solutions to problems that were unnecessary to begin with. Or failing to implement solutions despite an obvious need to do so.

There are a lot of threads to pull on, here. Account takeover still holds tremendous value to threat actors. Why? In my opinion, it's because passkeys were a tremendous failure, no matter what adoption shows. If we wanted to just improve security for users, I think we didn't need to restructure the internet around another authentication mechanism that of course, provides attestation capabilities, we could've just improved on passwords. For more secure handling of passwords, PAKEs exist. Password managers exist. For anti-phishing, TOTPs exist. What if you could have the exact same passkey experience, but in such a way that everything can gracefully fallback to just passwords and TOTP, because they're the real keymatter at the end of it? Add a web standard that lets browsers and browser extensions hook into the login process, standardize PAKEs as part of the web. Cross-vendor syncronization? A problem easily solved if we ever wanted to.

Instead of that, we got the dumbest possible world. Passkeys are sometimes available, but often not. Can you sync your passkeys across devices? Probably, maybe they have blacklisted KeepassXC by now so maybe I can't :)

But a lot of stuff doesn't even offer me the option to use passkeys, so they still use passwords. Can I enter my password to log in still? No, of course not. See, I will helpfully get the option to enter my password, in addition to the option to use email or SMS, the most secure authentication scheme known to Man, but if I actually select password and enter my secure password from my secure password manager, what I get to find out is that the password option is actually password and email or SMS and there's no option to use TOTP. Oh, and you randomly get logged out for no reason sometimes.

Some of the bots will probably disappear. Like, whatever bot is throwing me several terabytes of nonsense traffic every month will probably eventually disappear since they're wasting so much bandwidth on doing literally nothing. I have no idea what the point is, but I know it can't be terribly valuable for them, and it's not terribly expensive for me. I'd love to know who the hell is doing that and why, though.

But since the web is ran mostly by crap companies like Google, it will never get its shit together, and we will get solutions like WEI and identitity verification to solve problems that were entirely manufactured (or caused by a significant lack therefore of) in the first place.

It's completely fucked.

By virtue of incompetent and ignorant Devs and middle managers. Our by virtue of greed and maliciousness.

Yeah yeah never attribute to malice what can be explained by stupidity... This time no. It's both.

it's all for nothing, because Cloudflare's scraping protection works about as well as a $5 padlock - good enough to dissuade bored teens, not good enough to dissuade even an amateur burglar. if someone wants to scrap your publicly visible data, they will. there's nothing you can do.
At the same time: it sure works well enough to annoy anyone with a "bad ASN" IP with 80 captchas a day.
exactly that's what I was thinking... like the day they provided a solution to the issue they posed
It's how I remember I've left my VPN on
Exactly. I’m constantly amazed at how little you actually need to bypass CF, Amazon, Azure WAFs and so on (Incapsula springs to mind too). When you look at the code you’ve come up with, it’s actually quite small and compact.

More to the point, these systems actually help scraping because proof of work unlocks essentially unlimited scraping, in my experience.

That said - from my experience on the other side, sure you can’t stop people like me or you, but you can stop 99% of the others. That’s more than worth it operationally.

What do you mean by ~"PoW unlocks unlimited scraping"?
Usually after you solve the POW challenge, sites let you make a lot of requests before asking you to complete another.
> Cloudflare's scraping protection works about as well as a $5 padlock

It sure seems to keep me, the casual visitor, far away from just about any site they "protect". I have zero desire to alter my browsing configuration or use extra tools to get around turnstile, I'd rather not even visit the site in the first place.

>, I'd rather not even visit the site in the first place

Until your bank, airline, and tax ministry start using them.

Even more reason to boycott sites using it now.
I vote with my wallet and dump misbehaving banks.
Overwhelming majority of customers doesn't even know they can care. And most of them wouldn't anyway. So your vote doesn't matter to anyone but you, sadly.
"Misbehaving" by protecting themselves
If you're willing to do it, a real browser with playwright is enough.
Playwright isn't sufficient for all cases.
Not for high volumes of data.
It is if you're willing to pay the extra overhead. ex: Google and MS both use rendered pages for advanced scraping.
$5 padlocks work against what most website owners care about: the common consumer who is using a different app and seeing their site content with someone else's ads on top of it.
> I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection", but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go - completely destroying the privacy of everyone involved.

I hate what the anti scrapper mechanisms have become but it really is the lesser evil. The alternative for many small operators is to just completely shutdown.