Hacker News new | ask | show | jobs
by bdcravens 1224 days ago
This marketing bit seems a bit conflicting:

"With MrScraper, you won't be blocked.

We use real browser instances to perform fast but human web scrapings, resulting in a much lower block ratio."

"won't be blocked" implies a zero block ratio. (I do a lot of work with Puppeteer and Playwright, and some larger websites are pretty advanced at their heuristics at catching automation, so true zero really isn't a defensible claim)

7 comments

It's obviously an exaggeration, but I think the point is to suggest that you'll have much higher success (as opposed to being blocked) with this service vs rolling your own.

Anyway, if you want to be technical about it, the marking is correct. YOU won't be blocked. The agent running on your behalf might be blocked, however...

But from a marketing perspective, this "you won't be blocked" falls into the acceptable simplification category. Maybe they could add a * footnote, giving some more detail elsewhere. But at this point in the landing page, it wouldn't make sense to try to state it more accurately as that would require too many words.

There’s a difference between acceptable simplification and misleading, and while the line is not stark landing on the wrong side of it won’t build as much trust over time.

How block you’ll be “blocked less” or some variation of that form?

Still simple, less risk of disappointment/trust issues.

Oftentimes being "blocked" is more nuanced than whether the site returns a 200 vs a 4xx. The site may render, but the backend API may respond differently based on the behavior it sees.
Removing “you won’t be blocked.” should be sufficient then.

It looks interesting. I tried puppeteer and playwright but never got the hang of it, so I might be a client for one of these scraper services one day. The first time I tried it I got immediately blocked (probably because it had no agent, which was a raspberry pi)

The best results always come when you run the browser in full GUI mode, rather than headless.
Thanks for sharing your point of view!

I will rewrite the copy to make better statements. Thank you so much

"It won't be blocked" = they imported the stealth plugin most likely
The stealth plugin is good, but not 100%. Some sites rely on heuristics other than what the browser reports.
What is a stealth plugin?
Additions to libraries like Puppeteer that help ensure that the browser being used looks more "organic", often by returning fake data that a normal browser would have (browsers have APIs with things like plugins and fonts installed etc)
Cool thanks for explaining!
> so true zero really isn't a defensible claim

I feel like this is saying your systems have perfect security, which itself is not a defensible claim

also considering tests above - \"webDriver\": \"FAIL\" - seems like you'll totally get blocked by any anti-bot
My actual browser that I use as a human failed that test so it's probably more on them than anything.

Or I might have some kinda of addin/setting configured from hacking around on something over the years.

One would hope that anti-blocking measures are implemented ethically and the documentation clarified to reflect that.
> anti-blocking measures are implemented ethically

Your assumption that blocking is somehow ethical by default is not unproblematic.

There's a world wide web built by academics for free exchange of information and there's a closed garden web built by major capitalists.

Just how free that exchange of information should be is not a settled problem. Some very libertarians argue along the lines of information "wanting to be free". Some commercial entities seem to identify copyright and trademark law with moral doctrines. There are plenty of arguments for in-between positions as well.

If we look at less democratic societies, the efforts made to circumnavigate state censorship are publicly lauded as morally good actions by the international community. Could an analogy be drawn to large corporations censoring the less fortunate in a economically uneven societies too, for instance?

That is a good reply generally, but this

Your assumption that blocking is somehow ethical by default is not unproblematic.

is itself an assumption.

The problem I'm concerned with is aggressive (either deliberately or ignorantly) crawling/scraping of non-commercial sites which often lack the financial resources to defend against activities enabled without apparent concern by tools like the site here.

If a site allows reasonable access in good faith, then subverting those limits and constraints for self-serving reasons is ethically dubious at best, and any service not addressing that while promising to enable that subversion should be questioned.