| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by olivia-l 460 days ago

Not only do they not respect robots.txt, but they publish an entire page[1] in their docs dedicated to circumventing scraping countermeasures.

I pointed their scraper at a url on my server to test it's behavior. It made four separate requests to the same page, three with the UA "udici"[2], and one with the UA "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" and a different IP. The first three requests were all made within the span of 1 second, and the fourth 27 seconds later.

I emailed their published support address asking for an IP range and UA. They gave me the entire IP range of google cloud, and ignored the UA question.

This goes well beyond the "it's up to our users to implement responsible scraping practices" implication from the developer's other comment[3]. Instead, their service behaves maliciously by default, and they have implemented and documented switches that users can toggle for additional malicious scraping behavior. As far as I can tell, it is not even possible to implement a robots.txt-respecting scraper on top of this, because I couldn't find any mechanism for users to set a specific UA string.

[1]: https://docs.hyperbrowser.ai/sessions/advanced-privacy-and-a... (archived: https://web.archive.org/web/20250322045952/https://docs.hype..., http://archive.today/2025.03.22-050029/https://docs.hyperbro...)

[2]: https://github.com/nodejs/undici

[3]: https://news.ycombinator.com/item?id=43442116

1 comments

dkh 460 days ago

Nice work.

I feel like there's a lot to unpack here, and still much to discuss in the broader context.

There's a few things that can be excused or at least reasonably argued. Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior. If it was, I think you would've seen all four requests as coming from different IPs and with different user agents all posing as legitimate user browser sessions.

But the rest is inexcusable. Providing documentation on how to circumvent countermeasures without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools. Taking the time to respond to your inquiry about the IP range and UA but giving an answer that is somewhere between intentionally incomplete and intentionally misleading. (Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?) Just very poor decisions on how to handle the extremely predictable resistance they were very obviously going to encounter

link

olivia-l 460 days ago

> Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior.

Agreed, just using multiple IPs isn't malicious on it's own. I thought it was notable in the context of issuing another request with a generic browser UA. It's possible that the IP change was a deliberate strategy to avoid detection (like changing the UA likely is), but also possible that it was just a side effect of their infrastructure design, or a combination of the two.

> without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools.

So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now. (You can see that this page is missing from the sidebar in my archive link from earlier). I particularly enjoy the parts where they say "follow robots.txt rules" and "limit RPS on one site", because as far as I can tell it is actually not possible to do either of these things as a user of this service. There is no mechanism (that I could find) to set an identifiable user agent on the scraper client, nor a mechanism to control the delay between crawling different pages. It's not impossible that they have implemented a reasonable rate limit, with proper backoff when it appears the target is under load, but I wouldn't bet on it.

> Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?

Good question! I am not able to test this, because they don't expose the proxies without paying them money, which I do not intend to do. My guess would be "no".

> still much to discuss in the broader context.

Yeah. The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop. Until recently, we were getting upwards of 20k requests per day from individual LLM scrapers on a gitlab instance that I admin. The combination of malice and staggering incompetence with which these are operated is incredible. I have observed fun tactics like "switch to a generic UA and increase the crawling rate after being added to robots.txt" from the same scraper that isn't smart enough to realize that it doesn't need to crawl the same commit hash multiple times per hour. The bit that tells you not to get stuck crawling the CI pipeline results forever is there to protect you, silly.

Things are reportedly much worse[2] for admins of larger services. I saw this referred to as a "DDOS of the entire internet" a while ago, which is pretty accurate.

What we ended up doing is setting up an infinite maze of markov chain nonsense text that we serve to LLM scrapers at a few bytes per second. All they have to do to avoid it is respect robots.txt. I recommend this! It's fun and effective, and if we're lucky, it might cause harm to some people and systems that deserve it.

[1]: https://web.archive.org/web/20250322072210/https://docs.hype... [2]: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

link

dkh 459 days ago

> The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop.

My thing is, there are legitimate uses for automated browsing, uses that could be extremely useful and yet nondamaging to (or even supported by!) site operators. But we'll never get to have them if the tools/methods to implement them are the same as the ones used by people inadvertently DDOSing the site they're trying to inhale the entire contents of. For them to not get lumped together, purveyors of the tools cannot remain "neutral" or hide implementation details or condone, whether explicit or implicit, bad behavior. We've seen this happen on the web before, and we're already seeing desperate organizations implement nuclear-option LLM scraper blockers that also take out things like RSS readers. Anyways... I may just have to write something about this...

> So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now.

Congrats, you made an impact :)

link