Hacker News new | ask | show | jobs
by 486sx33 616 days ago
It’s unfortunate and kind of dystopian. We have an opportunity to properly archive all of the worlds online data and catalog it for very very low cost (historically), so that the future of our planet will have a much better reference point for the past.

Instead of that, companies are sucking up as much crap as possible, and tokenizing it and then scrubbing it, and adding “safety” to it.

Reality is always much stranger than fiction.

1 comments

We have billions of people, we can accomplish two maybe three things at a time. This is a valid use as any of that archived data. The part that sucks isn't that people are doing unusual things with it like training AI, but that copyright & capitalism make it so that everyone has to go get their own data themselves to the annoyance of web admins.

The biggest technical hurdle to sharing the work among interested parties is the web only authenticates the pipe, not the content.

CommonCrawl tries to archive the web and share it openly so everyone doesn't have to scrape it themselves.

"Our goal is to democratize the data so that everyone, not just big companies, can do high-quality research and analysis."

Because they share it openly including with those doing AI, they wind up on "AI crawler" lists, which are increasingly used by blocking tools that just "use the AI list", by people who don't like AI, or, quite ironically, people who are trying to prevent the excess traffic that poorly mannered AI crawlers cause. (Common Crawl's crawler is well mannered, uses good user-agent, respects robots.txt including crawl-delay, etc)

https://commoncrawl.org/

Copyright & capitalism is the crucial part of how we have the technical foundation that got us ML and most of the material used for training it. Big tech companies that want to monetize it at scale would like us to not think about that (or any long term consequences that do not affect shareholder value beyond current management), of course. If anything, the problem with intellectual property law is that they feel it’s safe for them to ignore it when it comes to ordinary people’s work (good luck suing ClosedAI).
> This is a valid use as any of that archived data.

No, it's really not, as most of the people who actually spend the time and effort to produce that content did not consent to it being used to train AI.

> copyright & capitalism

That's a really disingenuous way to say "the creators of that data didn't consent to training or commercial use and I want to steal their effort".

I was actually going for the dynamic where sharing isn't caring in this space. Because in theory it would be great if there were a few good companies who crawled the internet for you and sold access to it but in practice those companies are pushed to charge an arm and a leg which drives med-large companies to be incentivized to have to get it themselves.
I don't consent to paying rent, but I still have to. If it's legal for one party it should be legal for all parties. The law shouldn't pick favourites. If ChatGPT (owned by Microsoft) can copy my data I can download unlicensed Windows. If I can't, it can't.
Yes, I completely agree that the law shouldn't pick favorites.

To clarify: the creators of the majority of online content haven't consented to their content being used to build AI models for any company or organization. For US-based "creators", that includes both domestic companies like Anthropic, OpenAI, Google, and foreign companies like ByteDance.