Hacker News new | ask | show | jobs
by blantonl 1614 days ago
I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.

13 comments

I think it really depends on the application of web scraping. (As someone who does, what is in my mind, ethical web scraping)

- Scraping public information from government websites to do analysis: ethical, it's the public's data

- Scraping to help some companies customers more effectively use that companies product, for example scraping a medical office's insurance claims to help them automate their insurance remittance process: ethical

- Scraping faces to build a surveillance-tech company: disgusting

- Scraping your own website because your internal processes are so broken you can't get it any other way: ethical

- Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

The first one here is important. Despite the open data movement pressuring governments to provide their data in easily consumable forms, a lot of government organizations are still unable or unwilling to do so.

Political advocacy orgs rely a lot on scraping to collect political representative data that isn't available through any other means.

Yes, and so do research orgs. My organization does a lot of scraping because we deal with local election data and that's. Uh. Let's just say that if all counties had websites that were like Web 1.0, that would be an improvement over the current situation.
- Scraping faces to find missing persons: ethical

- Scraping photos to create deep learning VQGAN+CLIP art generator: ethical

.. we can go on and on, but we should all agree scraping is a useful tool that should never be outlawed.

I don't think that it is outlawed though, at least in practical terms, no one is gonna sue you for scrapping government websites. You really only think about the legal aspect when you do it for commercial gains.

It would be interesting to know if that data can be used in a court case against a government agency though.

> - Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

Wanted to include a slightly different application:

- Scraping multiple websites and organizing data in a new and useful way for customers: To me this would be ethical since it produces new value and does not just copy someone else's data as-is

So it's really not about the "scraping" here, it's about the kind of business you're building. I don't think any of your definitions change if you simply employed people to check the websites instead of scripts.
Re government websites: they're often terrible. I've occasionally contemplated a side project just to scrape and restructure some local/state websites into a usable forms with search and whatnot.
And if you manually copy someone's data they worked hard to generate to go and resell, then it's ethical?
Google is web scrapper number one, as any search engine. Making web scrapping illegal mean making search engine illegal.

You do not want information to be public and/or free? Put it under login and charge for it.

You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.

However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.

Google does do some things that aren't great for website owners too. Like "rich snippets", where they present the information from your page right to the end user, leaving that end user with no reason to visit your site.

And, I imagine, lots of A/B testing geared toward exactly that...keeping them on Google-owned properties.

Maybe if all the useful content on your site can fit into a snippet I don't want to visit it?
Maybe the useful content is something you don't know is there, so you settle for what's in the snippet. Because you imagine Google's AI surely extracted the right bits.

There's also a sort of diminishing returns effect here. If google trains people that the snippet is good enough, less traffic goes to the site. Eventually, enough to shutter the site, for some sites. Then nobody has the info.

The pattern has already affected Google referral traffic to Wikipedia. Pageviews for Wikipedia are roughly flat from 2012 to today, where they had marked growth prior. 2012 is when Google starting rolling out their knowledge graph that presented Wikipedia data directly.

Yes, it would be preferable if people were more curious and willing to explore topics in depth. But sometimes all you want to know is what's the capital of Moldavia. Ideally the web would be about easy access to relevant information, not a competition for harvesting page views.
Ok. FWIW, I'm not talking about simplistic facts. Rich snippets are often multiple paragraphs. And I understand the distaste for harvesting page views, but websites are hard to maintain without visitors too.
That always struck me as unethical as well.
What if Google didn't scrape websites automatically, and waited till users submit their domains to them, to mark that they want to be scraped? I think in that case, most users would still submit their domains there, because they want to come up in Google search. You might want your website to be scraped by some people/companies and not by others, but not have to put everything behind a login screen (which some determined scrapers would still try to breach in some way).
NB: It’s “scraping”, not “scrapping”.
Google is a crawler not a scraper, these are two totally different things
A crawl requires "extraction" of data from a web page, which according to Wikipedia is part of the definition of so-called "web scraping". Even if a crawler is using a sitemap.xml file, it still has to "scrape" (retrieve and extract from) that file first. It seems crawling always requires scraping.

If all the pages to be retrieved are known a priori, before retrieval begins, then one would likely call that "scraping". Whereas if not all pages are known before retrieval begins, then one would likely call that "crawling".

> I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

Scraping is simply a way to get data. I used to run a team that was paid by large government contractors in the US to scrape their job posts from their career portals, and then deliver those posts via email, fax and snail mail to veteran's service officers near the job opening. It was required by regulation, and the only way to get the job data was to scrape.Many enterprise applicant tracking systems did not have a good way to automatically deliver that data or wanted $millions for that capability. Scraping was the best way and in some cases, the only way.

By the way, search engines like Google are scrape data and index it.

Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:

- Google: Ahrefs & SEMRush scrape Google so they can provide SEO analytics to companies looking to grow their companies. Googles keyword analytics aren't great, so Google has effectively outsourced providing a good analytics tool to Ahrefs & SEMRush who products increase the value of the Google SERPs ecosystem.

- Amazon + Other E-Commerce: Amazon wants brands and 3rd party stores to list products on their site, and the companies scraping Amazon to provide product placement tools to their users make it easier and more profitable to list products on Amazon. Leading to more and more companies listing products on Amazon.

> Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

Archiving is unethical?

Good point, wouldn't say archiving is unethical at all...I was thinking more along the lines of someone scraping a entire segment of a websites data and reproducing it 1 for 1 on their own site with zero value add.

I think we can't make broad statements saying that web scraping is ethical or unethical, it isn't that black or white. It really depends on what is being scraped, how is it being used, and the intention of the scraper.

Do you provide an API, paid or not, for the same data? An API which might even have limitations on use makes scraping a bit less defensible in my mind, but if you're offering something for free to the public and then getting upset when people take and use that free info, maybe free isn't the right business model, or maybe you should look into what those people are using that scraped data for and see if you can offer it better and cheaper.

The best way to stop someone trying to make a buck on your hard work is to go direct to their customers and do a better job. If you can't, what they're selling is something on top of your offering and you aren't serving that market, and you either should start serving it, or make a deal so the scrapers can continue to do it without impacting your service.

As someone that had to do scraping in the past, and went through having a free open API that served our needs perfectly replaced with an account based one that required we make 100x the queries, it was really frustrating that the company refused to even respond to queries for specific business accomodations to data.

Here are two use cases why I scrape YouTube.

- There is no external API for getting scheduled streams or when they have gone live AFAIK. This lets me be notified of new stuff to watch.

- The API for getting a channel's members is locked down. I applied for access to it 6 months ago and haven't heard anything about it from YouTube so I just scrape it to give members perks.

Madness that they haven't gotten back to your access request in 6 months!

Why even bother having the API there - so much value can be added by people building on top of YouTube and other large sites, its a shame that most of these large sites do nothing to provide API access and people have to go out of their way to scrape them them...

There are pro-social and anti-social uses of web scraping. If you have ever used Kayak or any other price discovery or price comparison website, you've relied on web scraping to provide you a service.
Also google or any other search engine
I believe Kayak has agreements with the sites they scrape though. So it's a different type of "scraping", really.
When I want to do web scraping is because I have an idea to build over the content of the website I would like to scrape.

Let's say you made a recipes website and I would like to build an app that will order the ingredients for a meal.

It would be useful to extract the recipes, so that I can create experiences like users picking a meal and have the ingredients delivered.

I guess I can't show your recipes as it can be copyright infringement but I can link it to you and sell the tomatoes.

Also, despite copying someones work is unethical and likely illegal , there is nothing unethical or illegal to use computers to analyse the data out there. I should be able to analyse recipe publications just as I can measure the air pollution. The web scarping comes in since the semantic web never happen.

I think, we all should be able to use other people's work to build something else on top of it. Of course I do not advocate outright taking it and re-sell it as of ours.

For example, I would like to be able to create an app with Netflix content but obviously I don't expect to be able to stream their content as if it is mine. What I should be able to do is to create an app with an experience designed by me that lets you stream their movies if you pay them.

Because there would no Internet search - no search engines, no Google Search, and essentially no Internet bigger than a hobbyist DARPA - without web scraping.
> people who are trying to build businesses on top of other people's innovation and data

How would scraping, say, reddit, differ from the business model of Reddit itself?

> those that are doing it to my platforms are doing so solely to steal data

What kind of data are you talking about?

Scraping itself isn't universally unethical. Google and Bing scraping websites to make information easier to access is fine, and scraping and analysing government data is even better. Public data should be public, after all.

However, the disgusting data brokers that employ most of the custom scrapers, are usually unethical. That's why I don't trust any person or company that admits being involved professionally in "scraping", because most of the time that means "we collect personal information that got leaked elsewhere and sell them on".

If we want to take the unethical route, I’d argue not providing an API (paid or free) is unethical and a nasty business practice.

I work for an ecommerce company and we scrape competitors for price information. Should this automated process using API’s not be okay, we’ll have humans do it. Less efficient for us, more traffic for a competitor. Should they provide a paid API with price information available, I’m sure we’d pay.

I think if you make intangible things public you shouldn’t consider them to be only yours anymore.