| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adamtaylor_13 17 days ago
	So if you need to prevent bot abuse, but also don't want an ugly captcha every time someone goes to sign up, is there a better option?

3 comments

ribtoks 17 days ago

Use proof-of-work captchas, many are private by default. Look into Private Captcha or Cap captcha.

link

mootothemax 16 days ago

Speaking from the scraper’s perspective, I like proof of work; a ten year old 96-core server will cost a couple of quid to run for a few hours and will grab an absurd number of pages thanks to the access granted by repeatedly solving proofs of work. Small slick codebases too!

link

tardedmeme 16 days ago

There's also the Anubis idea where your PoW is persistent until your IP address or session cookie changes, so you get to skip PoW in exchange for making yourself identifiable, which means the PoW can then be ramped up to take a couple of minutes.

I don't use Anubis though. I just make my site not take five seconds to render a page so bots can overload it easily? It's not actually that hard?

link

Velocifyer 16 days ago

It would be more profitable to mine bitcoin.

link

arbol 16 days ago

PoW doesn't stop bots.. It's an annoyance at most. A rate limiter and nothing more

link

0123456789ABCDE 16 days ago

PoW difficulty can be scaled, eg: all cookies must work 1s, but 2nd cookie from the same ip, might have to do 2s of work

ideally one would pick something a bit more forgiving than a linear function, to avoid penalizing too much users connecting from CGNAT

link

arbol 16 days ago

I think we're talking about 2 different things. PoW is annoying for basic scrapers but it really doesn't affect enterprise grade bot operations with access to unlimited residential proxies.

link

0123456789ABCDE 15 days ago

look, if you have the kind of bank to pay $1/GB of web pages to get to whatever we're hosting, maybe just send an email to sales so we can give you discount on that.

link

matheusmoreira 16 days ago

Can this be repurposed as some kind of distributed cryptocurrency mining mechanism? Pay websites by mining some monero in order to access them?

link

phoronixrly 17 days ago

How does proof of work stop bots?

link

stephantul 16 days ago

Because it destroys the economics of scraping. It’s too expensive with proof of work, or at least not as economically viable

link

gruez 16 days ago

Depends on what type of scraping you're trying to stop. For the dumb scrapers that would try to scrape every page on a git forge (for which there are a bazillion pages for a modest project, because of how the site works), yeah it might deter them enough to stop. For anything high value (eg. reddit comments or retail prices), 10s of cpu time isn't going to stop them.

link

pmontra 16 days ago

It will not scare away bots but 10 seconds of wait (CPU or only a sleep) will turn away many real users. "This site is so slow, I'll use something else." A kind of reverse captcha.

link

Hnrobert42 16 days ago

Maybe, the proof of work can run in the background.

link

stephantul 16 days ago

Sure, the whole premise is exactly that proof of work reduces the value of scraping, while having negligible impact on users. If the data is so valuable that bot operators are willing to pay 10s of cpu, then other measures are necessary.

Nevertheless even for these high value cases, you can still argue that it disincentivizes the business model, it becomes less efficient.

link

thayne 16 days ago

If it's high value, there isn't really much you can do that will be completely effective. Traditional captchas can often be beaten by AI, or by "captcha farms" where impoverished people are paid pennies to complete captchas. Fingerprinting can be beaten by using a full browser to make the requests. Basically anything you do is just a matter of making it more expensive for bots to access it.

link

arbol 16 days ago

Beating fingerprinting and beating traditional captcha is far more expensive than solving pow. Pow doesn't stop anyone, not even the most novice bot operators

link

tobyhinloopen 16 days ago

You can just download all of Reddit from torrent sites

link

ranger_danger 16 days ago

5W load for 2 seconds is 0.002Wh, I think we'll be fine

Except it doesn't

If it gets too expensive/time-consuming to scrape then it won't happen at scale (as much)?

link

keynha 16 days ago

Behavioral signals are the usual answer: risk-scored, invisible challenges; proof-of-work (cost without identity, though it taxes mobile); and signup-velocity/rate limits that stop cheap abuse before any challenge fires. The reason fingerprinting wins anyway is that it requires less operator effort, not that it is the only thing that works.

link

arbol 16 days ago

Behavioural requires interaction. Fingerprinting is instantaneous and cloudflare runs on page load for lots of sites

link

ImPostingOnHN 17 days ago

The tool "Anubis" uses proof of work instead

link

BetterThanSober 16 days ago

With a tuned cool down period this isn't a problem, especially if you frequent the sites. OpenWRT uses Anubis and usually when I need to peruse their site I'm on a very low-end device. I prefer waiting much more over finding Waldos

But in principle I agree that there's no good answer to this, scraping _is_ useful and I bet most of us here had scraped something, it is AI company and their use of human's material for training without consent and return that led us to this (I know botting exists in forum since forum is a thing but it is easily solved by human moderators and keyword filter)

link

timpera 17 days ago

Anubis often takes more than 60 seconds to complete on low-end devices (especially old smartphones). It seems like there's no good solution.

link

QuantumNomad_ 16 days ago

But after you’ve completed the Anubis PoW challenge for a site, it remains valid for some amount of time.

So it’s not quite as horrible as it sounds.

I have setting up Anubis for my own sites on my todo list. And I wish more people did it too. I don’t really mind waiting a little bit extra every now and then before the page loads. What I do mind is ReCaptcha asking me to click all the pictures with buses in them etc. And especially when I have to do it several times over before it’s happy. I’d rather wait a minute for a page to load than to ever solve a ReCaptcha again, if given the choice.

link

mattstir 16 days ago

> So it’s not quite as horrible as it sounds.

I don't know about you, but if a random webpage takes 60+ seconds to load, I just close it and choose to never interact with that site again (unless it's my bank, which is a real and annoying occurrence).

link

ImPostingOnHN 17 days ago

There's not an easy, perfect solution, for sure. Newer phones get faster, but spammer compute gets cheaper.

Some sort of decentralized trust web seems like another option, though less viable.

link

WesolyKubeczek 16 days ago

One of unexpected outcomes from AI-induced hardware shortage may be that, in fact, compute won’t be getting cheaper and may in fact get more expensive…

link

dangus 17 days ago

That must be really low end then. I’ve never seen it complete in a timeframe that was slower than “I can’t even read the page before it redirects”

link

titularcomment 16 days ago

My guess is its an implementation error, not an hardware limitation. I have two 10-year-old devices and one passes instantaneously while the other halts for a good half minute every time.

link

toastal 16 days ago

It also requires JavaScript. I like to have JS off by default since running code on my machine is a privilege—one that I opt into, not the the site owner’s choice. This is frustrating since these blockers don’t let me know if the site is trustworthy first before needing to solve a Sudoku for Cloudflare or calculating useless hashes for Anubis.

link

phoronixrly 17 days ago

How does Anubis stop bots?

link

redwall_hp 16 days ago

Anubis is designed to stop a certain class of badly behaved bots. It intentionally doesn't run if a bot identifies itself with a UA, such as Googlebot, because then you can rate limit it or block by UA and with other tools.

Anubis is active when a user agent looks like a web browser (e.g. contains the "Mozilla" substring every major browser uses). The reverse proxy serves an interstitial page that does a proof-of-work check, validated server side, setting a cookie if it passes.

This means a legitimate user won't constantly get the proof of work check, because they already passed it. But AI bots rotating through tons of residential IPs to scrape your forum or git forge or whatever will be slowed down.

Overall, I like the idea. It's unobtrusive, privacy preserving, and seems to be working out well for a lot of sites.

link

basilikum 16 days ago

The real answer is that it makes sites behave different requiring the bots to make slight adjustments.

And there are just not enough sites using Anubis for the people and companies running the bots to care to do that.

If you do care bypassing Anubis is trivial.

link

arbol 16 days ago

It doesn't. It slows them down. To stop bots you need to employ the full suite of tools, fingerprinting, IP rep, behavioural analysis. Anubis will slow down your basic scrapers that try to crawl the entire web but it is useless against actual bots

link

xena 16 days ago

Bots don't execute JavaScript or follow complicated redirects.

link

ranger_danger 16 days ago

They have been doing it for years: https://roundproxies.com/blog/bypass-bot-detection/

link

ExpertAdvisor01 16 days ago

That's not true . A lot of bots are just headless chrome instances .

link

account42 16 days ago

This level of (mis)undertanding perfectly explains that crapware.

link

pwg 16 days ago

Bots don't [currently] execute JavaScript or follow complicated redirects.

They don't now, but enough "high value to the bots" pages turning on JS or complicated redirects will simply result in the bot authors adding JS execution or redirect following so they can continue "botting" the sites they want to scrape.

It's a hole with no bottom. Each one-up on the anti-bot side will eventually be handled on the bot side.

link