| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ribtoks 14 days ago
	Use proof-of-work captchas, many are private by default. Look into Private Captcha or Cap captcha.

4 comments

mootothemax 14 days ago

Speaking from the scraper’s perspective, I like proof of work; a ten year old 96-core server will cost a couple of quid to run for a few hours and will grab an absurd number of pages thanks to the access granted by repeatedly solving proofs of work. Small slick codebases too!

tardedmeme 14 days ago

There's also the Anubis idea where your PoW is persistent until your IP address or session cookie changes, so you get to skip PoW in exchange for making yourself identifiable, which means the PoW can then be ramped up to take a couple of minutes.

I don't use Anubis though. I just make my site not take five seconds to render a page so bots can overload it easily? It's not actually that hard?

Velocifyer 14 days ago

It would be more profitable to mine bitcoin.

arbol 14 days ago

PoW doesn't stop bots.. It's an annoyance at most. A rate limiter and nothing more

0123456789ABCDE 14 days ago

PoW difficulty can be scaled, eg: all cookies must work 1s, but 2nd cookie from the same ip, might have to do 2s of work

ideally one would pick something a bit more forgiving than a linear function, to avoid penalizing too much users connecting from CGNAT

arbol 13 days ago

I think we're talking about 2 different things. PoW is annoying for basic scrapers but it really doesn't affect enterprise grade bot operations with access to unlimited residential proxies.

0123456789ABCDE 13 days ago

look, if you have the kind of bank to pay $1/GB of web pages to get to whatever we're hosting, maybe just send an email to sales so we can give you discount on that.

matheusmoreira 14 days ago

Can this be repurposed as some kind of distributed cryptocurrency mining mechanism? Pay websites by mining some monero in order to access them?

phoronixrly 14 days ago

How does proof of work stop bots?

stephantul 14 days ago

Because it destroys the economics of scraping. It’s too expensive with proof of work, or at least not as economically viable

gruez 14 days ago

Depends on what type of scraping you're trying to stop. For the dumb scrapers that would try to scrape every page on a git forge (for which there are a bazillion pages for a modest project, because of how the site works), yeah it might deter them enough to stop. For anything high value (eg. reddit comments or retail prices), 10s of cpu time isn't going to stop them.

pmontra 14 days ago

It will not scare away bots but 10 seconds of wait (CPU or only a sleep) will turn away many real users. "This site is so slow, I'll use something else." A kind of reverse captcha.

Hnrobert42 14 days ago

Maybe, the proof of work can run in the background.

btown 14 days ago

Or it can run as part of a checkout wizard's "verifying your browser and processing your payment, don't close your tab" step.

stephantul 14 days ago

Sure, the whole premise is exactly that proof of work reduces the value of scraping, while having negligible impact on users. If the data is so valuable that bot operators are willing to pay 10s of cpu, then other measures are necessary.

Nevertheless even for these high value cases, you can still argue that it disincentivizes the business model, it becomes less efficient.

thayne 14 days ago

If it's high value, there isn't really much you can do that will be completely effective. Traditional captchas can often be beaten by AI, or by "captcha farms" where impoverished people are paid pennies to complete captchas. Fingerprinting can be beaten by using a full browser to make the requests. Basically anything you do is just a matter of making it more expensive for bots to access it.

arbol 14 days ago

Beating fingerprinting and beating traditional captcha is far more expensive than solving pow. Pow doesn't stop anyone, not even the most novice bot operators

tobyhinloopen 14 days ago

You can just download all of Reddit from torrent sites

ranger_danger 14 days ago

5W load for 2 seconds is 0.002Wh, I think we'll be fine

arbol 14 days ago

Except it doesn't

ray_v 14 days ago

If it gets too expensive/time-consuming to scrape then it won't happen at scale (as much)?