Hacker News new | ask | show | jobs
by ck2 4504 days ago
Putting aside legal issues, you don't have any moral problems doing volumes of scraping content that is not yours?
3 comments

Certainly not! If I was re-selling the data, maybe. But I'm generally using it for statistics and data viz. I include the source of the data and I always obey robots.txt. Sometimes I'm even able to talk with the owners beforehand to get their ok.

(Don't downvote him, it's a valid question)

Now you'll have to tell us about the project...

300TB is quite a lot, even today.

Over time, I've learned to wget every web page and content archive I want to keep. The Internet forgets.
In an earlier age, I ran everything through squid to consolidate browser caches. About five minutes after setting it up, I realised that pulling all the references in the log file and then indexing the lot with htdig would be tremendously useful when I was on the road without internet access.

I spent way too much time pruning stupid crap such as slashdot and started to learn this 'Bayesian classifier' thing.

Your idea is much better.

That's personal use, I have no problem with that. The above project sounds commercial in nature.
That seems pretty presumptuous...
Why should he? It's publicly available information.