Hacker News new | ask | show | jobs
by BlewisJS 1591 days ago
Unrelated to the article - is it just me or is this scrapingbee product borderline nefarious? From the homepage:

> Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots!

> Scrapingbee helps us to retrieve information from sites that use very sophisticated mechanism to block unwanted traffic, we were struggling with those sites for some time now and I'm very glad that we found ScrapingBee.

4 comments

It really depends. There are plenty of legitimate uses for scraping (for example, I've been involved with academic research that involved scraping Twitter search results), and it's only really feasible to collect the amount of data you need using scraping plus paid proxies. That being said, there are also a number of nefarious paid proxy services which offer residential IPs (read: are usually botnets).
What is legitimate to a user is not the same as what is legitimate to a site owner. The legitimate way would probably be to use the Twitter API.
The Twitter API has very low rate limits (from a data collection perspective). While there may be good reasons for that, these limits also preclude doing public interest research of the type we were doing (how Twitter's various search filters influence the political leanings of search results). When companies have Twitter's level of societal influence, I think it's also possible to define "legitimate use" in terms of public interest, rather than simply "users" or "site owners."
No more nefarious than the measures websites put up to avoid scrapers? This just rehashes the Linkedin vs Hiq case: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

(not a user, but I do some amount of scraping through other means)

It is definitely super annoying that companies are allowed to spy on us and do all kinds of crazy things with our data, all using computers and automation and "bots" and such, but individuals are increasingly not allowed to use automation to help us out online. Seems rather one-sided. On the other hand, I get that abuse is a huge problem. I do wish at least bots operating at roughly human request rates & daily total requests were considered OK and universally allowed without risk of blocks or other difficulties leading to increased maintenance costs (so, making them less valuable).
Sometimes the scraping situation gets kinda ironic. I worked at a large eRetailer/marketplace and obviously we scraped our major competitors just as they scraped us (there are four major marketplaces here). So each company had a team to implement anti-scraping measures and defeat competitor's defences. Instead of providing an API everyone decided to spend time and money on this useless weapons race.
Absent someone breaking really far away from the pack, that's a classic example of one type of "bullshit job" called out in Graeber's book... Bullshit Jobs. Zero-sum, ever-escalating competition. Militaries are another obvious example (we'd all be better off if every country's military spending were far closer to zero—but no one country can risk lowering it unilaterally, and may even be inclined to increase theirs in response to neighbors, which sometimes gets so insanely wasteful that you see something like the London Naval Treaty or SALT come about in response) but so is a great deal of advertising and marketing activity (you have to spend more only because your competitor started spending more—end result, status quo maintained, but more money spent all around)
I wonder how anyone in IT could take Graeber seriously. One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.
The presentation of that in the book, based off a message from someone in the industry, doesn't seem out of line with the overall tone and reliability-level that Graeber explicitly sets out in the beginning, which is both that the book is not rigorous science and that it's mainly concerned with considering why people's perceptions of their own jobs would be that they're bullshit.

[EDIT]

> One of his opinions about programming was that programmers work "bullshit jobs" for their employer and do cool open source stuff in their free time which is demonstrably false.

Further, I'm not even sure that's incorrect. It can both be true that most open source (that's actually used by anyone) is done by people who are paid to do it, and that most programmers have very little interesting or challenging to do at work unless they work on hobby projects—maybe open source—in their free time.

The overall letter as quoted in the book, and Graeber's commentary on it, actually makes some good points aside from all this. Things don't have to be perfect to be useful.

A company my previous employer partnered with once asked us to integrate with them.. via scraping and using bots to fill out forms.

Which would have been fine except they also imposed terribly low rate limits with no ability to check them.

We eventually pulled the partnership since it was more work than value.

A lot of data I provide to services is exposed to other individuals so that the service can function. They doesn't mean that data belongs to those people or that they can feely use that data elsewhere.

Allowing unfettered scraping and repurposing of data would have a chilling effect on all types of services. For example I wouldn't necessarily want a bot to scrape my comment history on HN, doxx me, and share my identity and comments with others.

I believe whenever the “no automation/scraping/bots” clause in Ts&Cs has been test in court they have never held up. However that’s not to say a service can’t just cancel your account if you are found to be using one.

Running a site thats had a bot get stuck in a loop and suddenly x10000 times the request rate, when they go wrong it’s super annoying for the website owner. We ultimately just banned the whole AWS ip ranges.

"Nefarious" is a strong word. Courts have repeatedly ruled that scraping data that is otherwise available publicly is legal. You may not personally agree with the ethics, but there are a lot of people who do.
I agree it's a strong word, which is why I said borderline nefarious. However, it's not that far off from a DDOS tool.

At least in the United States, sounds like the jury is still out on the legality: https://www.reuters.com/technology/us-supreme-court-revives-..., but my perspective was more from an ethics standpoint anyway.

It is very far from a DDOS tool. Scraping can be done from a single source, one request at a time, with self imposed rate limits. Sure it can overwhelm a server, but then so can a single user opening 10 tabs.
> Scraping can be done from a single source

That's not what this tool does though. It allows you to distribute your scraping to a layer of proxies. So, the only difference is whether there is an intent to do harm to the target or merely collect data... which could be a form of doing harm as well?

There are plenty of tools like this where going up to the line is much different than crosing it. There's a vast difference between driving your car to an event and driving the few extra meters into the crowd at an event. You can cut down a tree with a chainsaw or cut down a tree onto your neighbours house.

There's definetly an argument that dangerous tools should be regulated to varying degrees. If we're arguing regulations in this specific area you'd probably also be balancing it with regulations that sites can't close an account for reasonable rate automated access and that public research can have higher rates so long as they're not crippling.

The tree example is true and why I agree these things are very similar. The only significant difference is when you put it on your neighbor’s house on purpose.

I wouldn’t regulate this but If you’re introducing regulations, why not just require the source to deliver the data in a neatly packaged format? The necessity for scraping and the potential for DDOS and potentially nefarious behavior basically goes away.

Based on another comment, and the wikipedia article they linked to, it looks like the Supreme Court vacated the decision and remanded the case for further review in June 2021 (probably after this article).[1] Unfortunately there is no citation for that sentence so I'm not entirely sure.

I think that means the jury is still out, as you mentioned, but it's leaning towards scraping being legal as long as the data is publicly available. IANAL

[1] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

Nefarious? Then they should arrest Google first, it is the king of web scrapers.
Robots.txt
If the google crawler actually respected robots.txt your point might be salient.
It does.

Please verify your experience with the Google ip range.

https://developers.google.com/search/docs/advanced/crawling/...

A lot of crawlers spoof the Googlebot user agent so you wouldn't block them ;)

Surely you must be joking. Alphabet is the largest web scraper in the world. They would soon go out of business if robots.txt was the only data they scraped.

It’s not a web crawler. They are all web scrapers. And Alphabet/Google sells this data and makes profits from it.

It is not like it is trying to hide the fact that it is king web scraper.

Google has gotten in trouble from various publishers for this before. It is no secret there is a double standard in big tech.

Again if you are going to arrest a web scraper, then arrest the king of all web scrapers first to make it fair.

Data wants to be free. If it is publicly accessible then it is fair game.

I'm probably not going to get a reply, but let's try:

Source ?

It does.
Think how ridiculous it sounds that Google only has URLs listed in robot.txt. They wouldve gone out of business long ago.
Do you know how robots.txt works?

It's an exclusion standard, not an inclusion one.

https://en.m.wikipedia.org/wiki/Robots_exclusion_standard

For helping individual url discovery, you can use sitemap.xml.

In case you know how it works ( and i suppose so considering your account age), your comment is just weird tbh.