Hacker News new | ask | show | jobs
by faktory 1228 days ago
ChatGPT isn't doing the scraping, humans are. And humans are using computers to both read the article and create content or to scrape it.

So not it's not a false equivalence.

2 comments

There’s a reason scraping is a legally grey area.

> Web scraping is legal, US appeals court reaffirms

First, the case is not closed. [0]

Second, to draw an analogy, you can use scraping in the same way you can use a computer: for legal purposes. That is, you cannot use scraping to violate copyright, just as you cannot use a computer to violate copyright.

The following being my conjecture (IANAL), there is fair use and there is copyright violation, and scraping can be used for either—it does not automatically make you a criminal, but neither is it automatically OK. If what you do is demonstrably fair use presumably you’d be fine; but OpenAI with its products cannot prove fair use in principle (and arguably the use stops being fair already at the point where it compiles works with intent to profit).

[0] https://news.ycombinator.com/item?id=31079231

It seems the issue with scraping as it pertains to copyright issues isn't the scraping, any more than buying a book to sell off photocopies of it cheaply doesn't indicate that there is a problem with buying books. The issue is the copying, and more importantly, the distribution of those copies.

Fair use of course being the exception.

Now, as for accessing things like credentials that get left in unsecured AWS buckets is the bigger area where courts are less likely to recognize the legality of scraping. Never mind the fact that these people literally published their private data on a globally accessible platforms in a public fashion. I'm not a lawyer but I've seen reports of this leaning both directions in court, and yes, I've seen wget listed as a "hacker tool."

This is what happens when feelings matter more to the legal system than principles.

And before it's brought up, I may as well point out that no, I don't condone the actual USE of obviously private credentials found in an AWS bucket any more than I condone the use of a credit card that one may find on the sidewalk. Both are clearly in the public sphere, unprotected, but for both there is a pretty good expectation that someone put it there by accident, and that it's not YOUR credential to use.

Basically, getting back to the OP, ChatGPT hasn't done anything I've seen that'd constitute copyright infringement -- fair use seems to apply fairly well. As for the ad-supported model, adblockers did this all first. If you wanted to stop anything accessing your site that didn't view ads, there are solutions out there to achieve this. Don't be surprised when it chases away a good amount of traffic though -- you're likely serving up ad-supported content because it's not content you expected your users to pay for to begin with.

Yes but that's a technical issue. I took the parent as making a philosophical point and responded in that spirit.
Wouldn’t it be nice if the people on these forums were not ignorant of both philosophy or the legal system before diving into incoherent conversations about both at the same time where the main thrust is the emotions they have about these tools?
One can dream.
yup
How is it not scraping? There's no other way to get all that data for training a model without scraping.
It's scraping both when humans do it and when the ChatGPT team do it, but that wasn't the point the parent made. He made a moral/philosophical point which is what i responded to.