Hacker News new | ask | show | jobs
by xena 460 days ago
Is there support for robots.txt so service operators can opt out of your mass scraping?
3 comments

Not only do they not respect robots.txt, but they publish an entire page[1] in their docs dedicated to circumventing scraping countermeasures.

I pointed their scraper at a url on my server to test it's behavior. It made four separate requests to the same page, three with the UA "udici"[2], and one with the UA "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" and a different IP. The first three requests were all made within the span of 1 second, and the fourth 27 seconds later.

I emailed their published support address asking for an IP range and UA. They gave me the entire IP range of google cloud, and ignored the UA question.

This goes well beyond the "it's up to our users to implement responsible scraping practices" implication from the developer's other comment[3]. Instead, their service behaves maliciously by default, and they have implemented and documented switches that users can toggle for additional malicious scraping behavior. As far as I can tell, it is not even possible to implement a robots.txt-respecting scraper on top of this, because I couldn't find any mechanism for users to set a specific UA string.

[1]: https://docs.hyperbrowser.ai/sessions/advanced-privacy-and-a... (archived: https://web.archive.org/web/20250322045952/https://docs.hype..., http://archive.today/2025.03.22-050029/https://docs.hyperbro...)

[2]: https://github.com/nodejs/undici

[3]: https://news.ycombinator.com/item?id=43442116

Nice work.

I feel like there's a lot to unpack here, and still much to discuss in the broader context.

There's a few things that can be excused or at least reasonably argued. Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior. If it was, I think you would've seen all four requests as coming from different IPs and with different user agents all posing as legitimate user browser sessions.

But the rest is inexcusable. Providing documentation on how to circumvent countermeasures without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools. Taking the time to respond to your inquiry about the IP range and UA but giving an answer that is somewhere between intentionally incomplete and intentionally misleading. (Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?) Just very poor decisions on how to handle the extremely predictable resistance they were very obviously going to encounter

> Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior.

Agreed, just using multiple IPs isn't malicious on it's own. I thought it was notable in the context of issuing another request with a generic browser UA. It's possible that the IP change was a deliberate strategy to avoid detection (like changing the UA likely is), but also possible that it was just a side effect of their infrastructure design, or a combination of the two.

> without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools.

So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now. (You can see that this page is missing from the sidebar in my archive link from earlier). I particularly enjoy the parts where they say "follow robots.txt rules" and "limit RPS on one site", because as far as I can tell it is actually not possible to do either of these things as a user of this service. There is no mechanism (that I could find) to set an identifiable user agent on the scraper client, nor a mechanism to control the delay between crawling different pages. It's not impossible that they have implemented a reasonable rate limit, with proper backoff when it appears the target is under load, but I wouldn't bet on it.

> Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?

Good question! I am not able to test this, because they don't expose the proxies without paying them money, which I do not intend to do. My guess would be "no".

> still much to discuss in the broader context.

Yeah. The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop. Until recently, we were getting upwards of 20k requests per day from individual LLM scrapers on a gitlab instance that I admin. The combination of malice and staggering incompetence with which these are operated is incredible. I have observed fun tactics like "switch to a generic UA and increase the crawling rate after being added to robots.txt" from the same scraper that isn't smart enough to realize that it doesn't need to crawl the same commit hash multiple times per hour. The bit that tells you not to get stuck crawling the CI pipeline results forever is there to protect you, silly.

Things are reportedly much worse[2] for admins of larger services. I saw this referred to as a "DDOS of the entire internet" a while ago, which is pretty accurate.

What we ended up doing is setting up an infinite maze of markov chain nonsense text that we serve to LLM scrapers at a few bytes per second. All they have to do to avoid it is respect robots.txt. I recommend this! It's fun and effective, and if we're lucky, it might cause harm to some people and systems that deserve it.

[1]: https://web.archive.org/web/20250322072210/https://docs.hype... [2]: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

> The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop.

My thing is, there are legitimate uses for automated browsing, uses that could be extremely useful and yet nondamaging to (or even supported by!) site operators. But we'll never get to have them if the tools/methods to implement them are the same as the ones used by people inadvertently DDOSing the site they're trying to inhale the entire contents of. For them to not get lumped together, purveyors of the tools cannot remain "neutral" or hide implementation details or condone, whether explicit or implicit, bad behavior. We've seen this happen on the web before, and we're already seeing desperate organizations implement nuclear-option LLM scraper blockers that also take out things like RSS readers. Anyways... I may just have to write something about this...

> So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now.

Congrats, you made an impact :)

Would you like to explain how directing your user agent to use the internet just as you would in order to complete a task or solve a problem is "mass scraping"?
> to use the internet just as you would

Your premise is flawed, or at the very least, far from guaranteed. One could use tools like this to use to browse the internet only as they would otherwise. One could also use it for mass scraping, and many do. If you've looked at HN on previous days this week, there's been a front-page story nearly every day about problems resulting from exactly that.

The parent comment was perhaps a bit too snarky with the assumption that this could/would only be used for malicious behavior on a large scale, but your assumption that it wouldn't be is not any better, and also runs contrary to what people have been experiencing and discussing in this arena in recent days

Intent.
Care to elaborate? That is not a substantial argument.
No, we don't enforce any robots.txt restrictions ourselves. We also don't do any scraping ourselves. We provide browser infrastructure that operates like any normal browser would - what users choose to do with it is up to them. We're building tools that give AI agents the same web access capabilities that humans have, don't think it's our place to impose any additional limitations.
It is 100% your responsibility what your servers do to other peoples servers in this context, and wanton negligence is not an excuse that will stop your servers from being evicted by hosting companies.
You make the tools, what people do with them isn't up to you. I can tolerate some form of that opinion/argument on some level, but it is at the very least short-sighted on your part to not have been better-equipped for how to respond to concerns people have about potential misuse.

If what has been said elsewhere in this thread is true about providing documentation on how to circumvent attempts to detect/block your service and your resistance to providing helpful information such as IP ranges used and how user agents are set, then you have strayed far from being neutral and hands-off.

"it's not our place" is not actual neutrality, it's performative or complicit neutrality. Actual neutrality would be perhaps not providing ways to counter your service, but also not documenting how to circumvent people from trying. And if this is what your POV is, fine! You are far from alone--given the state of the automated browsing/scraping ecosystem right now, plenty of people feel this way. Be honest about it! Don't deflect questions. Don't give misleading answers/information. That's what carries this into sketchy territory

Do you publish an IP range?