Hacker News new | ask | show | jobs
by observationist 837 days ago
Relevance and impact aside, if you publish something to the internet on a site with no access restriction in place, I don't know how you can keep a straight face while claiming some sort of moral right to the content. It's the equivalent of broadcasting it over radio, or printing and delivering it straight to the doorsteps of millions of random individuals. Methinks you doth protest too much, or something.

There are ways of copyrighting data, and establishing ownership of intellectual property. Your tumblr fanfic, youtube comments, or HN discussions are not legitimate copyright avenues. Stuff you post to legally scrapeable websites are fair game for fair use.

I can do anything I want in private to any data I collect. I could create an awesome HN LLM on the scraped datasets, and use it privately to my hearts content. I can even set up an API to that LLM that generates content, and, given recent rulings, even if i had all the written copyrighted data in the world, as long as I was making good faith efforts to ensure copyright was being respected and works weren't being recreated verbatim, then I could even use that model commercially. I just couldn't sell it to other people, or distribute it, without entering a different legal regime.

I can collect any data I want from public facing websites.

That's how the internet works; it's how it was designed. There are authentication mechanisms, network configurations, and a myriad other access control schemes you can implement to prevent public access. If you post to sites without those mechanisms, you're tacitly agreeing to give up any plausible claims of protection against a wide array of fair uses well established by precedent cases at this point. If you don't prevent public access, and you've got a domain name on a server, you're tacitly inviting the world to come download whatever it is you have on your server. This is a social good. This is what we want when we participate in the internet.

Insisting on some sort of vague entitlement as to how "your" data gets used completely bypasses the fact that anything you consider to be misused in OpenWebText2 fundamentally stems from the fact that you posted the content to a publicly visible website and gave up any say in what happens thereafter. It was scraped fair and square.

Don't complain that you didn't know the rules, or that life isn't fair.

It's not even clear that terms of service or those little popups on public websites have any legal relevance. If your website is open to the public, then it's fair game. If you post content to a public website, then that content's fair game.

2 comments

It feels like you're picking apart an argument I didn't make. But I would note that most people don't see this so unambiguously as the position you're defending. To give you an analogy: doxxing is "fair game" too if you posted your info online or gave it to others. But it's not exactly cool to do it, right? It's a subversion and abuse of the system we have in place.

Finally, here's a fun experiment: decide that terms of service don't matter and start building a product by scrapping Facebook or Google. See how they'd react. Actually, no need for guesswork - they clutched their pearls and threatened legal action more than once before. It's a bit of a "have your cake and eat it too" kind of a deal. Their data is precious intellectual property; your stuff is, well, up for grabs.

Oh, for sure, they get all pearl clutchy when others try to do exactly what they have done, and they get all "not like that!" about it. The US is a society run by lawyers, and the big corps have the best lawyers. Maybe we can legislate out of the hole at some point, but it's a pretty grim outlook. Google et al also don't have to have the law on their side, they can simply litigate people and businesses into bankruptcy, regardless of the legal merit of their actions.

At any rate - there are ways of staking legitimate claim to content you publish online. Even by doing so, it may not be relevant. Robots.txt is a convention, not a regulation or law. It's respected out of social nicety, not because it's strictly legally required.

If you publish your data to a website where it's publicly visible, you are inviting the world to come download your data. When that data leaves your server and goes to live on the downloader's computer, the downloader can do whatever they want with that data.

It's not clear that it's legally possible to prevent the use of data in training models unless you require someone to sign a contract to that effect before being allowed to download your data.

That would be obnoxious, and I wouldn't bother with your content anymore. Like Instagram, LinkedIn, and Twitter, your site would get a 127.0.0.0 hosts file entry.

The US needs a clear, modern update to copyright law that upholds and maximizes individual rights, as well as privacy and property concerns. We shouldn't be playing this game where we pretend a website is somehow an analogy for a page of text scribed with a quill pen and using laws developed to handle issues when quill and parchment were relevant.

Let's write some new laws where we regulate what things are, and not play tortuous mental gymnastics to contort and butcher existing laws and precedents to say whatever the most expensive lawyers want.

Maybe the social contract allows for people to prevent their conversations from being scraped and used by third parties without explicit consent, even if the conversation is entirely public. I don't like that view, but I see the argument for it.

As things stand, though, fair use and public access make things pretty bright and clear, and rulings in various AI cases so far have favored broad fair use interpretations, and are requiring complainants to show specific, particular harms. If/When those harms are shown, then we'll see if any carveouts will be made, or if broad fair use interpretations will be the baseline for content scraping going forward.

> It's the equivalent of...printing and delivering it straight to the doorsteps of millions of random individuals.

Which, incidentally, the New York Times does and they seem to think they have some legal right to the redistribution of their work.

Maybe they're right, maybe they're wrong, it's up to the courts to decide.