Hacker News new | ask | show | jobs
by henrik1409 3905 days ago
It's great the see discussions going on here - would like to tie a few comments to the questions of ethical aspects of web scraping:

As some has pointed out scraping is not exactly a new thing and a lot of the biggest sites out there are built on the basis of web scraping or crawling. We provide a tool and expect you use that tool while abiding the law - and if not we will of course shut your account down immediately. Breaking the law includes violating copyrights and performing DDoS attacks (Although they will be rather small attacks since even 50 concurrent agents is no big deal for most websites).

We consider ourselves good netizens. We wish nothing more than to provide a good, easily accessible and safe tool for extracting valuable information from the internet, be it for a price comparison site in a market that lacks transparency, business intelligence for your company to make informed and wiser decisions, or a PhD project that requires access to millions of data points available online in unstructured form.

Additionally if you feel we're providing services that has ill-intent - we are not providing any services (Captcha and proxy rotation) that anyone with a bit of programming skill can not easily use in their own software. The main difference is that we are actively improving and focusing not only on making a good experience for our users - but also on minimizing the impact on the sites being scraped. This involves several things like automated throttling and slow-site detection, request caching, and blocking requests to services such as google analytics - to not interfere with site owners stats.

3 comments

If you are a good netizen, could you plese provide your user agent so I can block your bot on all sites I operate?

Thank you.

EDIT: found that in your FAQ:

"Since disclosing IP’s and user agents would allow anyone to identify all traffic coming from our system – we naturally never do."

That is the opposite of being a good netizen and I hope I'll be able to sue you once I find out your services are helping to scrape my content.

2nd EDIT: Found out that you reside in Denmark and therefore in the EU, that makes it way easier then.

Saying 50 concurrent agents is no big deal for most websites is kinda flippant. It all adds up and not every popular website can handle loads of traffic. By using a headless browser to gain a proper version of a webpage, you also strain a webserver with serving static files costing them bandwidth as well.
The problem is "the law" is murky on web scraping. For example, did you know that even if your users are only extracting non-copyrighted (even non-copyrightable) data from a page, a judge once ruled that the act of storing the entire page in RAM constituted copyright infringement, since it contained some copyrighted elements that were immediately disposed of after extraction (like the company's logo)? This was Ticketmaster v. RMG Technologies, and it was used against Power Ventures in Facebook's case against them.

Contrast with Feist Publications, Inc. v. Rural Telephone Service Co., where it was ruled that it was legal to copy data from a phonebook and republish, since it was non-copyrightable factual data.

There are several other ridiculous early rulings that were made while the internet was still coming of age, and I think before many judges really understood the way it worked. Recent cases have been bucking these precedents, but you can still get the book thrown at you based on those rulings.

Read about 3Taps and please understand that you will be sued, as they were, unless you fold the moment you get a C&D, which would make your site fairly useless.

Google and all other search engines are illegal in the US in most cases. They just don't get in trouble for most of their activity because people usually want to be on Google. If you end up collecting data in a way that someone doesn't like, things won't go so well for you. See Facebook Inc. v. Power Ventures, Inc.. That guy got raked over the coals; I'm sure Facebook was trying to make an example of him.

Data portability is a threat to the business model of many web incumbents, and that means they want scraping, a critical tool for ensuring that portability, to remain in a nebulous grey area; this allows them to use it for their own purposes (which they often do) and also to try to block people who are using data found on their platform in a way they don't like. This basically results in the bigger company getting their way, because only other multi-billion dollar companies really have the resources to fight against the army of $1k/hr lawyers that public companies hire to try to enforce their opinions on upstarts.

What we really need is serious internet law reform that favors a fair and open platform. Unfortunately, whenever we hear about "internet law reform", it's skewed to the interests of the megacorps who want more tools to shut down innovators that may threaten their business models, not toward creating an open and fair environment for innovation.

Consider, for instance, how ridiculous it would be if every time you opened a book one of the title pages contained a "Terms of Reading" that bound you not to use the information in the book, even the non-copyrightable information, in any way that the book's publisher didn't like, required you to only read the book using the publisher's approved reading methods (perhaps only Oakleys and Ray-bans are publisher-approved eyeglasses, only Herman-Miller publisher-approved seating, and only GE bulbs publisher-approved lighting), required you to agree that you'd never sue the publisher in court but always use private arbitrators that the publisher can easily, even implicitly, buy off, and so forth.

Consider the viability of the argument that you committed copyright infringement by looking at the pages of the book when the author didn't want you to, because the reflection of the content on your eyes constituted an illegal copy.

These things would get laughed out of court, but the digital equivalent is frequently upheld when it comes to online activity.

I think eventually things will stabilize and scraping non-copyrighted data will unambiguously not be a crime, but unfortunately, I think it may still be a few more decades until that happens. I really hope your company is able and willing to help us set the right precedents by committing the tens of millions it will take to win each piece of that stability, since you're set up so perfectly to be the target of several scraping-related lawsuits.

Recent rulings, like QVC v. Resultly and Nguyen v. Barnes and Noble Inc. have been much more positive than former ones, even if they're not altogether ideal, indicating that some magistrates are starting to think of the internet in sensible terms. The rest has to be done through the legislature. Please help make the web safe for data.

IANAL