Hacker News new | ask | show | jobs
by quatrefoil 826 days ago
While a lot of attention has been given to books3, another large component of this dataset is the deceptively-named "OpenWebText2". What's that? It's a scrape of 15 years' worth of third-party websites that were linked to from upvoted Reddit submissions. I know this includes some of my writing.
2 comments

Relevance and impact aside, if you publish something to the internet on a site with no access restriction in place, I don't know how you can keep a straight face while claiming some sort of moral right to the content. It's the equivalent of broadcasting it over radio, or printing and delivering it straight to the doorsteps of millions of random individuals. Methinks you doth protest too much, or something.

There are ways of copyrighting data, and establishing ownership of intellectual property. Your tumblr fanfic, youtube comments, or HN discussions are not legitimate copyright avenues. Stuff you post to legally scrapeable websites are fair game for fair use.

I can do anything I want in private to any data I collect. I could create an awesome HN LLM on the scraped datasets, and use it privately to my hearts content. I can even set up an API to that LLM that generates content, and, given recent rulings, even if i had all the written copyrighted data in the world, as long as I was making good faith efforts to ensure copyright was being respected and works weren't being recreated verbatim, then I could even use that model commercially. I just couldn't sell it to other people, or distribute it, without entering a different legal regime.

I can collect any data I want from public facing websites.

That's how the internet works; it's how it was designed. There are authentication mechanisms, network configurations, and a myriad other access control schemes you can implement to prevent public access. If you post to sites without those mechanisms, you're tacitly agreeing to give up any plausible claims of protection against a wide array of fair uses well established by precedent cases at this point. If you don't prevent public access, and you've got a domain name on a server, you're tacitly inviting the world to come download whatever it is you have on your server. This is a social good. This is what we want when we participate in the internet.

Insisting on some sort of vague entitlement as to how "your" data gets used completely bypasses the fact that anything you consider to be misused in OpenWebText2 fundamentally stems from the fact that you posted the content to a publicly visible website and gave up any say in what happens thereafter. It was scraped fair and square.

Don't complain that you didn't know the rules, or that life isn't fair.

It's not even clear that terms of service or those little popups on public websites have any legal relevance. If your website is open to the public, then it's fair game. If you post content to a public website, then that content's fair game.

It feels like you're picking apart an argument I didn't make. But I would note that most people don't see this so unambiguously as the position you're defending. To give you an analogy: doxxing is "fair game" too if you posted your info online or gave it to others. But it's not exactly cool to do it, right? It's a subversion and abuse of the system we have in place.

Finally, here's a fun experiment: decide that terms of service don't matter and start building a product by scrapping Facebook or Google. See how they'd react. Actually, no need for guesswork - they clutched their pearls and threatened legal action more than once before. It's a bit of a "have your cake and eat it too" kind of a deal. Their data is precious intellectual property; your stuff is, well, up for grabs.

Oh, for sure, they get all pearl clutchy when others try to do exactly what they have done, and they get all "not like that!" about it. The US is a society run by lawyers, and the big corps have the best lawyers. Maybe we can legislate out of the hole at some point, but it's a pretty grim outlook. Google et al also don't have to have the law on their side, they can simply litigate people and businesses into bankruptcy, regardless of the legal merit of their actions.

At any rate - there are ways of staking legitimate claim to content you publish online. Even by doing so, it may not be relevant. Robots.txt is a convention, not a regulation or law. It's respected out of social nicety, not because it's strictly legally required.

If you publish your data to a website where it's publicly visible, you are inviting the world to come download your data. When that data leaves your server and goes to live on the downloader's computer, the downloader can do whatever they want with that data.

It's not clear that it's legally possible to prevent the use of data in training models unless you require someone to sign a contract to that effect before being allowed to download your data.

That would be obnoxious, and I wouldn't bother with your content anymore. Like Instagram, LinkedIn, and Twitter, your site would get a 127.0.0.0 hosts file entry.

The US needs a clear, modern update to copyright law that upholds and maximizes individual rights, as well as privacy and property concerns. We shouldn't be playing this game where we pretend a website is somehow an analogy for a page of text scribed with a quill pen and using laws developed to handle issues when quill and parchment were relevant.

Let's write some new laws where we regulate what things are, and not play tortuous mental gymnastics to contort and butcher existing laws and precedents to say whatever the most expensive lawyers want.

Maybe the social contract allows for people to prevent their conversations from being scraped and used by third parties without explicit consent, even if the conversation is entirely public. I don't like that view, but I see the argument for it.

As things stand, though, fair use and public access make things pretty bright and clear, and rulings in various AI cases so far have favored broad fair use interpretations, and are requiring complainants to show specific, particular harms. If/When those harms are shown, then we'll see if any carveouts will be made, or if broad fair use interpretations will be the baseline for content scraping going forward.

> It's the equivalent of...printing and delivering it straight to the doorsteps of millions of random individuals.

Which, incidentally, the New York Times does and they seem to think they have some legal right to the redistribution of their work.

Maybe they're right, maybe they're wrong, it's up to the courts to decide.

Care to give me your domain name so I can check all major llms for plagiarism? I have a feeling none of them can produce a sentence from your writings
It takes deliberate effort, but I was actually able to get pieces of my writing out of one of the leading LLMs (not ChatGPT). This is not particularly unique, a number of folks demonstrated the same.
How long were those pieces?
I would probably ask more how unique the string of text was; can't lay claim to something where the words naturally follow one another and searching Google comes up with several results.