| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Gigachad 40 days ago
	It's because they want to restrict AI companies from stealing content, but they can't do it if internet archive proxies it all for them. All of the LLMs would be massively less useful if it wasn't for scraping the latest news.

3 comments

stephen_g 40 days ago

LLMs have other ways of accessing the content, they don’t need the Web Archive.

Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.

link

overfeed 40 days ago

> LLMs have other ways of accessing the content, they don’t need the Web Archive.

What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.

Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.

link

AnthonyMouse 40 days ago

This is like arguing that services can't provide access to libraries that provide public WiFi because it would give the public legal permission to pirate TV shows. They're two unrelated things. And then some members of the public argue that they're making fair use rather than pirating anything, but that still has nothing to do with the library.

link

stephen_g 40 days ago

But as I understand it, the Web Archive does respect robots.txt, while LLM scrapers absolutely do not and use all sorts of dodgy methods to get around it already...

The actual root cause is that we're allowing LLM companies to completely disregard copyright laws for their profit. Whether the LLM companies scrape the Web Archive or the original source doesn't change the copyright infringement implications in any way, and cutting off the web archive doesn't practically change anything (because as I understand, LLM scraping is already prolific all over the web).

link

pseudalopex 29 days ago

Internet Archive do not respect robots.txt now. Or not consistently.

link

Gigachad 40 days ago

The legal implications would be different vs scraping publicly available content.

link

AnthonyMouse 40 days ago

Is there a case that actually says this? Why would whether something is fair use depend on that? For that matter, how would they even show that a given AI model was trained on something from a recursive crawler rather than the same articles added to the training data after being downloaded by hand?

link

Gigachad 40 days ago

There was a similar case where a web scraper was bypassing prevention mechanisms on linked in

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

link

fragmede 39 days ago

That case is why Twitter, and anyone else with lawyers paying attention went and put content behind a login wall.

link

AnthonyMouse 39 days ago

Twitter griefs everyone with a login wall because they want bulk downloaders to pay for API access instead and the login wall is an attempt to rate limit non-API bulk requests.

That isn't relevant to ordinary media outlets because a) they don't have enough content volume for rate limiting to be effective since it's possible to get everything they publish even at a slow rate limit, and b) getting AI scrapers to subscribe to their bulk download API instead is not the objective in their case.

link

AnthonyMouse 40 days ago

That case seems to imply the opposite?

link

switzer 40 days ago

LLMs would then license content from news orgs and other publishers, which is what should happen.

link

userbinator 40 days ago

"stealing" is BS because the original still exists. Copyright infringement is more correct.

link

Gigachad 40 days ago

You can call it whatever you want but it’s killing journalism when LLMs can automatically scrape and reword all the news. Sucking up the profits without contributing anything back to the people who created the work.

link

NeutralCrane 39 days ago

I don’t think many people are getting daily news from LLMs. Journalism has been dying since long before LLMs burst onto the scene as well.

There really isn’t even a defensible argument as to how this even should be illegal. The idea that someone can read words about a concept, and then rewording an explanation of that concept somehow violating the rights of the original author, is absurd.

The issue here and elsewhere isn’t LLMs. It’s that IP as a concept has always been a dystopic farce. Despite this we have not only kicked the can down the road on addressing this, we’ve doubled and tripled down and built our society around the concept. The advent of AI has simply blown the scale of the problem up to the point where it cannot be ignored any longer.

link

fragmede 39 days ago

> I don’t think many people are getting daily news from LLMs.

How many people do you think use LLMs in some fashion at all in their daily lives? Genuine question, I'm sure my personal experience is a biased sample, but so is everyone else's. Stats from AI companies isn't going to be (seen as) objective either. OpenAI and Anthropic are pushing a feature where I get a situation report at 9am like I'm an important official. With both labs pushing that, I think some people are getting their daily news from LLMs, the question is how many would it take for it to be meaningful, and how would we know if/when that bar gets crossed? What are the implications of that?

link

AnthonyMouse 40 days ago

The general problem here is that as soon as something is news, there will be not only numerous articles about it from multiple publications but also discussion of it on social media.

Which means LLMs have a zillion sources to get the story. Removing any given subset isn't going to prevent it from having the information in the training data, all it does is prevent that subset from being archived for future humans.

link

jasonfarnon 40 days ago

they're stealing page views

link