| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bee_rider 389 days ago
	“Not possible” = “a business-destroying level of honesty”?

2 comments

rcxdude 389 days ago

Even if training on the copyrighted material is OK, just providing a data dump of it almost certainly is not.

link

alpaca128 389 days ago

No need for a data dump, just list all URLs or whatever else of their training data sources. Afaik that's how the LAION training dataset was published.

link

anonymoushn 389 days ago

providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.

link

echoangle 389 days ago

Aren't the datasets mostly shared in torrents? They probably won't bitrot for some time.

link

Wowfunhappy 388 days ago

...no? They also use web crawlers.

link

bee_rider 388 days ago

The datasets are collected using web crawlers, but that doesn’t tell us anything about how they are stored and re-distributed, right?

link

tokioyoyo 389 days ago

There is a "keep doing what you're doing, as we would want one of our companies to be on top of the AI race" signal from the governments. It could've been stopped, maybe, 5 years ago. But now we're way past it, so nobody cares about these sort of arguments.

link