| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andrew_zhong 128 days ago
	Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions. Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

3 comments

plastic041 128 days ago

robots.txt is the most basic access restrictions and it doesn't even read it, while faking itself as human[0]. It is about bypassing access restrictions.

[0]: https://github.com/lightfeed/extractor/blob/d11060269e65459e...

link

zendist 128 days ago

Regardless. You should still respect robots.txt..

link

andrew_zhong 128 days ago

We do respect robots.txt production - also scraping browser providers like BrightData enforces that.

I will add a PR to enforce robots.txt before the actual scraping.

link

plastic041 128 days ago

How can people believe that you are respecting bot detection in production when your software's README says it can "Avoid detection with built-in anti-bot patches"?

link

andrew_zhong 128 days ago

I hear you loud and clear - will replace the stealth browser with plain playwright and remove anti-bot as a feature.

link

messe 128 days ago

> It's not about bypassing access restrictions.

Yes. It is. You've just made an arbitrary choice not to define it as such.

link

andrew_zhong 128 days ago

I will add a PR to enforce robots.txt before the actual scraping.

link

messe 128 days ago

Or just follow web standards and define and publish your User-Agent header, so that people can block that as needed.

You're creating the wrong kind of value. I really hope your company fails, as its success implies a failure of the web in general.

I wish you the best success outside of your current endeavour.

link