| Not only do they not respect robots.txt, but they publish an entire page[1] in their docs dedicated to circumventing scraping countermeasures. I pointed their scraper at a url on my server to test it's behavior. It made four separate requests to the same page, three with the UA "udici"[2], and one with the UA "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" and a different IP. The first three requests were all made within the span of 1 second, and the fourth 27 seconds later. I emailed their published support address asking for an IP range and UA. They gave me the entire IP range of google cloud, and ignored the UA question. This goes well beyond the "it's up to our users to implement responsible scraping practices" implication from the developer's other comment[3]. Instead, their service behaves maliciously by default, and they have implemented and documented switches that users can toggle for additional malicious scraping behavior. As far as I can tell, it is not even possible to implement a robots.txt-respecting scraper on top of this, because I couldn't find any mechanism for users to set a specific UA string. [1]: https://docs.hyperbrowser.ai/sessions/advanced-privacy-and-a... (archived: https://web.archive.org/web/20250322045952/https://docs.hype..., http://archive.today/2025.03.22-050029/https://docs.hyperbro...) [2]: https://github.com/nodejs/undici [3]: https://news.ycombinator.com/item?id=43442116 |
I feel like there's a lot to unpack here, and still much to discuss in the broader context.
There's a few things that can be excused or at least reasonably argued. Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior. If it was, I think you would've seen all four requests as coming from different IPs and with different user agents all posing as legitimate user browser sessions.
But the rest is inexcusable. Providing documentation on how to circumvent countermeasures without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools. Taking the time to respond to your inquiry about the IP range and UA but giving an answer that is somewhere between intentionally incomplete and intentionally misleading. (Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?) Just very poor decisions on how to handle the extremely predictable resistance they were very obviously going to encounter