Hacker News new | ask | show | jobs
by TrueDuality 220 days ago
This is a false equivalency I'm surprised no one else has brought up. An archive of a site preserves attribution inherently, the scraping and training are not.
2 comments

Is it? I thought it was ridiculous at first, but the more I think of it... both are scenarios where a corporation is scraping billions of webpages. We like the reason archive.is does it, but unless it's some kind of charity, I think it's a reasonable comparison.
archive.is is a charity no? Or at least they take donations, it seems the legal entity behind it is nebulous, but they don't have ads and have no paid product or offering.
They sure as shit do have ads. Have you ever accidentally followed a link using a browser profile that has no ad blocking enabled?

I only rarely browse without some form of content blocking (usually privacy-focused... that takes care of enough ads for me, most of the time). I keep a browser profile that's got no customizations at all, though, for verifying that bugs I see/want to report are not related to one of my extensions.

Every once in a while, I'll accidentally open a link to a news site (or to an archive of such a site) in that vanilla profile. I'm shocked at how many ads you see if you don't take some counter measures.

I just confirmed in that profile: archive.is definitely puts ads around the sites they've archived.

I stand corrected, maybe it's because I have ad-blocks that I never noticed.

And arguably I used to think it was the Internet Archive.

It does make this case seem problematic now that I know the details.

So if OpenAI or <AI scraper of the day> adds attribution to their AI-generated answers, everything is OK?
It would be closer to okay.