Hacker News new | ask | show | jobs
by Kalium 2684 days ago
This might surprise you, but it actually has to do with what traffic coming out of TOR looks like. Well in excess of 90% of traffic coming out of TOR is spam, bots, malicious, or some combination!

Google isn't going out of their way to punish you for trying to protect your privacy. They're trying to stop unwanted traffic. By unfortunate happenstance, you appear to be disguising yourself in the exact same way a shocking amount of bad traffic is.

16 comments

Not just for Tor.

I use Firefox with a few basic extensions (Privacy badger, uBlock, Google Container) yet every time I am presented with having to pick out traffic lights over and over and over again. I usually have about 5 or 6 "challenges" before I give up and use another site.

My timezone has not changed, my IP address and rough location has not changed, my screensize has not changed, my broadband speed has not changed, and my general computer dexterity has not changed, yet I am relentlessly targeted. On chrome I never saw these challenges, but on firefox with the privacy plug-ins I am always always always challenged.

At this stage I think the only signal it is using is "is there a google cookie in this browser? and if so has the google cookie got some 'normal' looking activity logged against it?" I.e. they are checking their server-side logs for a given cookie ID and seeing if that looks normal or not (i.e. seen on google search, seen on youtube, seen ads from a variety of third parties on various different sites, mixed up with time of day and speed of viewing etc etc).

Since I have got Google in a container in Firefox, I am guessing that my google cookie is not present when the captcha loads (due to the containers and privacy badger et al) so there is no identity back in the mothership to compare me against.

for google, you are the enemy. not even bots.

captcha is google master blow against ad blockers.

a regular user, who they have all the info, give them dollars per ad impression. You, with your doNotTrack (ha! that was a joke) and privacy addons makes them only cents per ad impressions.

you are google's enemy. remember this when you get stuck in captcha hell (and consequently censored from most sites until changing device/ip)

IDK. I run Firefox on many OSes, everywhere with uMatrix that blocks known trackers, ad networks and such. I don't see most ads (if any).

I rarely see the "I am not a robot" box, and hasn't seen image recognition tasks for a long-long time.

That also heavily depends on what kind of/which sites you visit.
"that also depends if you have something to hide" was said of every police state and censorship scheme.
if you were really blocking all trackers, Captcha would even work. Firefox help page for their new tracker blocking feature says so even.
They're on a lot of sites that I frequent.
Yup. It's insufferable. Even on sites where I'm a paying customer, I have to go through captcha garbage.
If you're a paying customer, complain to the company. Let them know their site is annoying and frustrating to use because of this.

If they lose enough customers over this, they will probably remove the captcha.

I think quora over states what Google looks at by a wide margin, just try to access a captcha in incognito, they won't have access to as much info as they do on you and yet you're still presented with the same level of captcha (if not more of them, which is to be expected)
Sometimes just checking the checkbox is enough. Sometimes you need to identify cars and store fronts. I think the better Google knows who you are, the more likely just the checkbox is going to be enough. If you go incognito, you have to train their neural nets, if you give up your privacy, you get in for free.

The clever part from Google's perspective is that you have to trade one of these things to Google in order to get access to sites that do not belong to Google at all. Google convinced site owners to have their users pay a tax to Google.

There are many services out there that can solve Google's recaptcha for fraction's of a penny. When someone puts one up, they can make things more expensive, and perhaps sometimes uneconomical, but in general, the cost is low (~$2.00 for 1,000 recaptchas).

When someone uses a recaptcha, they should think about why they are doing so. It's one thing to use it to save a business model, but it's another to use it to protect information that should be free anyway. The elephant in the room is government data. Many government agencies think that selling their data can be a nice source of side revenue, and a recaptcha is a good way of enforcing it. In reality, they just increase the costs for everyone, and those with means can obtain the information while those without means cannot.

Governments need to release their data, freely, without captchas or fees for single users and bulk users, no exceptions.

I've actually been pleasantly surprised at how much data /is/ available, and how much of it is available through common formats like Socrata Open Data API (for use with tools like https://github.com/xmunoz/sodapy)

The counter argument is that they do a great job with trivial stuff like registered dog's names, and less well with sensitive/important issues like policing.

What's the right way to leverage the platform developed for the first into the second?

> Governments need to release their data, freely

Totally agree. Fortunately the Dutch government is trying to make as much data open as they reasonably can, and regularly organise events to encourage developers to use their open APIs.

> My timezone has not changed, my IP address and rough location has not changed, my screensize has not changed, my broadband speed has not changed, and my general computer dexterity has not changed, yet I am relentlessly targeted. On chrome I never saw these challenges, but on firefox with the privacy plug-ins I am always always always challenged.

That's because Google isn't just profiling "Tor users". They're going after anyone who values privacy in any way or technology.

Simply put, you're being punished for ensuring privacy. And anybody who uses Google's captcha services is an accessory to that.

There is no Google "punishment algorithm". It's just computers being dumb.
Somebody made those computers dumb in that exact way. That's the complaint.
I think that Google is more than happy to punish people for protecting their privacy. That may or may not be the main goal, but it doesn't appear to be something Google considers a downside.
Sometimes people intentionally make computers dumb.
Same thing happens to me, same extensions involved, mostly browse incognito. I bet your suspicion is spot on.
I use chrome with Privacy Badger + uBlock Origin and I have to solve the CAPTCHAs every single fucking time. I even have to solve them multiple times. At this point I just leave a page if they have one of those captchas.
>This might surprise you, but it actually has to do with what traffic coming out of TOR looks like.

That's a massive load of bullshit. Google has a captcha challenge that only humans can solve. That alone is already sufficient to prevent unwanted traffic. That is how every captcha system works. However google is an exception. If you're logged in to a google account or are using chrome then google can use that information to track your captcha history. Privacy minded people avoid google like the plague and therefore they cannot be tracked.

>Google isn't going out of their way to punish you for trying to protect your privacy. Except this is exactly what happens. It's not "unfortunate". It works like this by design.

If google cannot track you then the captcha will force you to do something that no other captcha system does: give you even more challenges even if you have solved them correctly. You will spend the next 5 minutes solving captchas correctly and then at the end it will tell you you've failed. This again is unique to google: correct answers lead to failure. The problem immediately goes away if you let google track you, it doesn't matter how bot infested the network is. No other captcha system does it this way.

Google is clearly doing this to get free labour to label their datasets, force people to have a google account and encourage them to use chrome.

If you are using TOR, and not accepting cookies, they are going to have no way of knowing that you are the same user who just solved the CAPTCHA. Every request is going to appear to be from a new user.

If you do everything you can to prevent google from knowing who you are, don't be surprised when they behave like they don't know who you are.

Tor Browser accepts session cookies. It won't have an established google identity, but it fully supports a temporary "solved the captcha" identity.
> The problem immediately goes away if you let google track you

I took that to mean they were blocking cookies

What prevents a botnet from sharing that same session cookie?
A botnet doesn't need Tor in the first place. And you can limit the use of a single captcha solution. It's not much different from the problem of a legitimate google account being borrowed by a bot.
The tile fade-in is also egregious. The only reason that exists is to punish humans.
Well, it punishes bots in an equal amount, in that the bots have to wait longer before they can retry.
It punishes humans more than computers because computers are more efficient multitaskers. A computer can find a productive way to use the second between each tile fade in, but a human has no realistic way to productively use that second. The human sits there staring at the screen waiting, while the captcha-solving computer does other things (perhaps solve other captchas given to it through other connections.)
Slight nitpick but past captcha successes are a characteristic of cyborg accounts, which still act as a bot most of the time.

A lot of the behavior that captcha exhibits is in part a function of feature analysis from ML models - features that may seem ridiculous to layman humans but make sense to a neural net plugged into the data.

> That's a massive load of bullshit. Google has a captcha challenge that only humans can solve. That alone is already sufficient to prevent unwanted traffic.

It's not bullshit, it just depends whether your website is being targeted directly or not. We're targeted directly and the robots hitting us are getting the CAPTCHAs solved, presumably with human help.

This sounds like it should be illegal.
I think you are partially wrong. Maybe Google is not doing this intentionally but it also doesn't happen just because traffic is coming out of a tor node. I am using ff with some of the recommended extensions from https://www.privacytools.io/ and I get to fill in traffic signs all the time. And yes I am logged into Google.
I think what OP is talking about is Cloudflare not Google's decision. Google provides the CATCHPA API but Cloudflare decides to flag nearly all Tor traffic and make it go through the CATCHPA.
In the case of Cloudflare specifically, they support Privacy Pass[0], an extension that allows solving one captcha to allow you through to multiple sites without de-anonymizing or reducing the security properties that tor provides.

[0] https://blog.cloudflare.com/cloudflare-supports-privacy-pass...

Cloudflare is a good actor, they offer the PrivacyPass extension that basically generates 30 auth tokens from one CAPTCHA challenge and then uses those until it needs new tokens. Sadly the overwhelming majority of sites doesn't use CAPTCHA through CloudFlare but directly through Google, rendering PrivacyPass moot.
Cloudflare is not a good actor in this, they have shown that they do not care about encryption (allowing non-https backends while showing https to the end user) and embedding trackers in verification pages (the CAPTCHAs on random pages).
Cloudflare is the scum of the internet. They've put a crazy amount of effort towards making wide swathes of the internet unusable for people trying to protect their identity and privacy. I wouldn't trust their implementation of Privacy Pass.
Sigh. We changed this so long ago yet people repeat this over and over again. Do you use the Tor Browser? Please show me a site on Cloudflare which uses CAPTCHA on Tor.
I don't know about TOR, but a couple of years ago we had a site on Cloudflare that had the CAPTCHA come up for visitors from mainland China - where the great firewall blocked the requests to Google. Chinese users were effectively locked out. We contacted Cloudflare about this and got dismissive replies.
I ended up removing a chrome extension that randomises user-agents because of this. It dramatically cut down google captchas.

Another thing that sets it off is virtual machine usage, I can be logged into chrome and gmail on the same residential IP for hours but the moment I try to search google for a problem inside a VM it's a minute of slow loading captchas.

Have moved to bing instead, that sort of wasted productivity is a burden.

This. The reality is, Google (and Cloudflare, and everyone else trying to block scrapers and malicious traffic) use heuristics that boil down to "99% of our traffic behaves like this". If you go out of your way to fall into the 1%, e.g. using Tor, disabling Javascript, randomizing your user-agent, etc., you're going to get CAPTCHAed.
Yeah, blending in seems to work better in many cases. Remember the guy who sent a bomb threat over TOR? The only reason he was caught so quickly was because he's the only person on the organisation's network to have accessed TOR before the incident.
> Google isn't going out of their way to punish you for trying to protect your privacy. They're trying to stop unwanted traffic. By unfortunate happenstance, (...)

This does not agree with my experience. I browse without cookies and severely limited javascript (using umatrix), and I also encounter the myriad of ridiculous inconveniences that the OP was referring to. On the good side, however, the web is much faster and generally less annoying.

> I browse without cookies and severely limited javascript (using umatrix), and I also encounter the myriad of ridiculous inconveniences that the OP was referring to.

Isn't this also something that many bots do (don't run javascript and don't have realistic cookies)? It seems like just another instance of reducing your distance from the "bot" cluster in agent-space.

These are exactly the kinds of behaviors that bots sometimes engage in.
It's often the website provider redirecting users to a captcha based on certain conditions.
As a webmaster I can confirm that I hard block all TOR traffic for this exact reason. 90% of this traffic is malicious robotic junk of some form.

Also, I’m just not interested in the remaining 10% "legit" traffic from people who are aggressively paranoid about their privacy. Almost all of them ended up being dickheads who were using TOR to abuse other members of our community.

To the people who think every website should treat TOR users with respect, please understand that you are intentionally making yourself indistinguishable from the mountain of robotic junk, abuse and human dickheads. It's not my fault that you have chosen to do this, and it's not my job to provide you with tools to prove you're not a dickhead.

To the people voting me down, please understand that I am relaying factual information about my specific experience as webmaster of various large-ish regional websites. If you don't like the facts, voting them down won't change them.

...or maybe voting me down will change the facts.

Yeah, that's totally going to work.

Google seems to do the same even if you check the box while in an incognito window; I doubt the issue is TOR itself, but rather the lack of tracking data that Google has on that particular session.
Have you used Captcha on TOR? It really does feels like they're trying to punish you. They give you about 4 pages of "identify the traffic light", all of which are difficult for humans, then reject and give you another 4 pages. Or that thing where it fades out for about 7 seconds before you click the next image, and then wait another 7 seconds.
Google used to HATE my VPN, couldn't do anything on Google through it without a dozen damn pics to choose from.

My VPN must have gotten white listed (or cracked down on some of their traffic patterns) because that stopped.

Or did they just get better at fingerprinting us?
Nothing would surprise me.... I mostly experienced it on an android phone....
If Google wants to do that, that’s their prerogative. What pisses me off is when a bank or similar “secure” type of service forces me to train Google’s ML models in order to access my stuff. I didn’t agree to provide unpaid labor to Google.
Running uBlock origin seems to trigger the same thing, even on a static IP. It feels an awful lot like punishment to me.
This is a horrible argument. What gave Google the right to be the moral authority of the internet (we, we did)? Even if 99% of exits from tor nodes are malicious, Google should have absolutely no capability to throttle this traffic. Unless you claim most of the traffic in tor are from bots, your argument doesn't make any sense. Captchas serve 2 purposes: slowing down bots, annoying humans. By putting captcha to tor exits, Google not only slows down miniscule amount of bots, but also annoys human traffic (good or bad). It is by no means a "good" thing that Google is capable of this.
I have exactly the same experience without using Tor, living in Germany...

I personally don't care too much about the hassle, but I really don't like the idea that I'm basically playing Artificial "Intelligence"/doing clickworking for the not so community oriented efforts of Google.

> Well in excess of 90% of traffic coming out of TOR is spam, bots, malicious, or some combination!

Do you have any data on this?

An excellent, wise, and cogent question! In fact I do have data. You can find it here: https://blog.cloudflare.com/the-trouble-with-tor/

> On the other hand, anonymity is also something that provides value to online attackers. Based on data across the CloudFlare network, 94% of requests that we see across the Tor network are per se malicious. That doesn’t mean they are visiting controversial content, but instead that they are automated requests designed to harm our customers. A large percentage of the comment spam, vulnerability scanning, ad click fraud, content scraping, and login scanning comes via the Tor network.

The obvious caveats apply, of course. It's completely possible what Cloudflare saw at the time is no longer true and TOR is no longer mostly spam. It's equally fully possible that the traffic Cloudflare sees is wildly unrepresentative of what TOR traffic actually looks like, and it's mostly people worried about their privacy. This is just the data we have at the moment.

A small percentage of bad actors using automaton can produce a lot of traffic. So although it may be true that a large portion of the requests coming from TOR exit nodes is malicious, it would be unwise to conclude that most users of TOR have bad intentions.
True, but from the perspective of an org like CloudFlare, that doesn't matter. They don't know (or care) about the user breakdown coming from Tor; they just know that the vast majority of traffic coming from it is malicious. And since part of the point of Tor is to make it hard to determine who's who, the good traffic gets binned with the bad.
I don't think anyone is concluding that.
I think a lot of people come to exactly that conclusion.
Why would they? Most people cognizant of these terms knows a bot generates more traffic than a human; that’s the point of most bots.
Cloudflare's documented experience aligns closely with mine; I've been limiting or blocking TOR ever since 2008 because over 90% of the traffic was malicious bots, and the majority of the remainder was malicious humans.

And when you have malicious traffic swimming in an anonymous pool, there's no practical alternative but to block all of it.

Isn't cloudflare the org that "Doesnt censor under any circumstances", and then turned around and censored white supremacists? Not that I agree with them (I DONT!), but it was a full 180.

And also, isn't cloudflare also the one to allow booters and stressers to be online behind CF - and they used stolen CC's to boot?

The Tor decisions to screw users over is just the cherry on top. Especially is egregious is when a captcha is demanded on even a simple static page. Seems pretty obvious what's going on here.

Everyone should censor and shun white supremacists. They have no place in modern society. When they shed their noxious views, we can all welcome them back with open arms.
Ok, so you've decided that being white supremacist is bad. I can agree with you on that, but still the question remains: who get's do decide what has a place in modern society? Who decides what "modern society" even is? Today Google might decide to censor white supremacists, tomorrow it can be human rights advocates. I think that allowing any type of censorship, even for such a noble cause as fighing racism is a slippery slope. Especially when done by a private company that is outside of our control (and governments are only marginally better).
You're trying to generalize a useful rule ("shun white supremacists") but it doesn't work in this case. I don't think we need to, either.

We're not robots. We can shun white supremacists and leave everyone else alone. This isn't a slippery slope, it's just good sense (no more white supremacists, hey!). Humankind will get along just fine if we tack on that one extra rule and all follow it.

I think they do exactly that. For example disabling browser fingerprinting in firefox and not being logged into Google causes the majority of sites to display the captcha, especially when using a VPN.
They could use a memory-hard hashing function, like ARGON2 for proof of work, it would make spamming much harder.
Not really, because spam isn't done on the spammer's hardware. Not to mention, an expensive hashing function is precisely something bots can do but humans cannot.

If you're putting constraints on Tor traffic, it's not because of raw throughput. It's because it's extremely poor quality traffic.

I see..the goal of ARGON2 is not to be expensive, but to be hard to parallize. Anyways the other points that you wrote make sense.
You're absolutely right! It could even be integrated meaningfully into browsers to make it easier to work with. Something Cloudflare's Privacy Pass (https://support.cloudflare.com/hc/en-us/articles/11500199265...) could work.
It looks really nice.

It should be default for the TOR browser for sure, if just a few people use it, it decreases the anonimity set.

Nah, it was released back in 2017. I've seen it discussed periodically ever since.

The issue with just doing memory-consuming work client-side is that it only marginally slows down spamming. Spammers tend to use compromised machines they don't own. Unless you can make it prohibitively expensive to calculate something using machines you don't pay for - perhaps not a trivial ask - you wind up needing a different set of tools. This is why Google tends to look at things that will exhibit human variation rather than pure computation.

It's not that your ideas aren't good. I'm sure ARGON2 has a use here! It's that this might not be a problem easily solved by consuming more resources.

Cool, I'll try it out the next time I have a problem with using TOR. You're right that ARGON2 doesn't help if CPUs/RAM are free, it just makes parallelization hard.
Parallelization is easy if you have a botnet of millions of machines owned by others to draw on.