Hacker News new | ask | show | jobs
by marginalia_nu 428 days ago
You basically need proof-of-work to make this work. Idling a connection is not computationally expensive, so is not a deterrent.

It's a shitty solution to an even shittier reality.

1 comments

Main author of Anubis here:

Basically what they said. This is a hack, and it's specifically designed to exploit the infrastructure behind industrial-scale scraping. They usually have a different IP address do the scraping for each page load _but share the cookies between them_. This means that if they use headless chrome, they have to do the proof of work check every time, which scales poorly with the rates I know the headless chrome vendors charge for compute time per page.

Is there any particular date/time you'll introduce a no-JS solution?

And are you going to support older browsers? I tested Anubis with https://www.browserling.com with its (I think) standard configuration at https://git.xeserv.us/xe/anubis-test/src/branch/main/README.... and apparently it doesn't work with Firefox versions before 74 and Chromium versions before 80.

I wonder if it works with something like Pale Moon.

It will be sooner if I can get paid enough to be able to quit my day job.
I used to have an ISP that would load balance your connection between different providers, this meant that pretty much every single request would use a different IP. I know it's not that common, but that would mean real users would find pages using anubis unusable.
Do you think that, if this behavior of Anubis gets well-known and Anubis cookies are specifically handled to avoid pathological PoW checks, does Anubis need a significant rework? Because if it's indeed true this hack wouldn't last much longer and I have no further idea to avoid user-visible annoyances.
Well, if they rework things so that requests all originate from the same IP address or a small set of addresses, then regular IP-based rate limits should work fine right?

The point is just to stop what is effectively a DDoS because of shitty web crawlers, not to stop the crawling entirely.

> Well, if [...], then regular IP-based rate limits should work fine right?

I'm not sure. IP-based rate limits have a well-known issue with shared public IPs for example. Technically they are also more resource-intensive than cryptographic approaches too (but I don't think that's not a big issue in IPv4).

> then regular IP-based rate limits should work fine right?

These are also harmful to human users, who are often behind CGNAT and may be sharing a pool of IPs with many thousands of other ISP subscribers.

> Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers

Based on the comments here, it seems like many people are struggling with the concept.

Would calling Anubis a "client-side rate limiter" be accurate (enough)?

Probably not