Hacker News new | ask | show | jobs
by observationist 805 days ago
The 9th Circuit Court of Appeals found that scraping publicly accessible content on the internet is legal.

If you publish something on a publicly served internet page, you're essentially broadcasting it to the world. You're putting something on a server which specifically communicates the bits and bytes of your media to the person requesting it without question.

You have every right to put whatever sort of barrier you'd like on the server, such as a sign in, a captcha, a puzzle, a cryptographic software key exchange mechanism, and so on. You could limit the access rights to people named Sam, requiring them to visit a particular real world address to provide notarized documentation confirming their identity in exchange for a unique 2fa fob and credentials for secure access (call it The Sams Club, maybe?)

If you don't put up a barrier, and you configure the server to deliver the content without restriction, or put your content on a server configured as such, then you are implicitly authorizing access to your content.

Little popups saying "by visiting this site, you agree to blah blah blah" are not valid. Courts made the analogy to a "gate-up/gate-down" mechanism. If you have a gate down, you can dictate the terms of engagement with your server and content. If you don't have a gate down, you're giving your content to whoever requests it.

You have control over the information you put online. You can choose which services and servers you upload to and interact with. Site operators and content producers can't decide that their intent or consent be withdrawn after the fact, as once something is published and served, the only restrictions on the scraper are how they use the information in turn.

Someone who's archived or scraped publicly served data can do whatever they want with the content within established legal boundaries. They can rewrite all the AP news articles with their own name as author, insert their name as the hero in all fanfic stories they download, and swap out every third word for "bubblegum" if they want. They just can't publish or serve that content, in turn, unless it meets the legal standards for fair use. Other exceptions to copyright apply, in educational, archival, performance, accessibility, and certain legal conditions such as First Sale doctrine. Personal use of such media is effectively unlimited.

The legality of web scraping is not disputed in the US. Other countries have some silly ideas about post-hoc "well that's not what I meant" legal mumbo jumbo designed to assist politicians and rich people in whitewashing their reputations and pulling information offline using legal threats.

Aside from right to be forgotten inanity, content on the internet falls under the same copyright rules as books, magazines, or movies published on physical media. If Disney set up a stall at San Francisco city hall with copies of the Avengers movies on a thumb drive in a giant box saying "free, take one!", this would be roughly the same as publishing those movie files to a public Disney web page. The gate would be up. (The way they have it set up in real life, with their streaming services and licensed media access, the gate is down.)

So - leaving behind the legality of redistribution of content, there's no restriction on web scraping public content, because the content was served intentionally to the software or entity that visited the site. It's up to the server operator to put barriers in place and to make content private. It's not rocket surgery, but platforms want to have their cake and eat it too, with control over publicly accessible content that isn't legal or practical.

Twitter/X is a good example of impractical control, since the site has effectively become useless spam without signing in. Platforms have to play by the same rules as everyone else. If the gate is up, the content is fair game for scraping. The Supreme Court gave the decision to a lower court, who affirmed the gate up/gate down test for legality of access to content.

Since Google and other major corporations have a vested interest in the internet remaining open and free, and their search engines and other tech are completely dependent on the gate up/gate down status quo, it's unlikely that the law will change any time soon.

Tl;dr: Anything publicly served is legal to scrape. Microsoft attempted to sue someone for scraping LinkedIn, but the 9th Circuit court ruled in favor of access. If Microsoft's lawyers and money can't impede scraping, it's likely nobody will ever mount an effective challenge, and the gating doctrine is effectively the law of the land.